Add AI bots to user-agent regex#148665
Conversation
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
Hi @mhl-b, I've created a changelog YAML for you. |
🔍 Preview links for changed docs⏳ Building and deploying preview... View progress This comment will be updated with preview links when the build is complete. |
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
masseyke
left a comment
There was a problem hiding this comment.
Looks good to me, but it would probably be good to get someone on logstash to take a look to make sure we're staying consistent with the logstash user agent filter.
|
Maybe @mashhurs could take a look -- he's reviewed this processor before. |
mashhurs
left a comment
There was a problem hiding this comment.
@mhl-b, Logstash useragent-filter plugin uses ua-parser rule-sets. However, AI bots/agent user agent strings are not there yet. There is an open PR adds a support and your changes align with that. Feature wise, I have tested and reviewed your changes, no concerns.
I am curious why in ES rule-sets are maintained/copied instead consuming from upstream source? And it looks like ES regexes.yml is not synced regularly with ua-parser and that Logstash and ES useragent results might differ.
Maybe @masseyke can tell how it started. This code just recently become part of distributed-team umbrella and I'm new to this. |
The user_agent processor misclassified AI crawler UA strings in two ways:
URL-suffix false positives. Crawlers like ChatGPT-User and meta-externalagent embed a +https://... URL whose path contains a bot/crawl token. The generic matcher picked up that URL token instead of the bot name, producing values like "com/bot" or "crawler".
No-keyword bots. Crawlers like MistralAI-User and Claude-User have no bot/spider/crawl in their name, so all generic patterns fell through to "Other".
Fix: add 9 explicit named entries for AI crawlers (OpenAI, Anthropic, Perplexity, Mistral, Meta, Cohere) before the generic matchers, and harden the generic [Bb]ot pattern with a (?<![./]) lookbehind to prevent matching tokens inside embedded URLs.
Tests: new data-driven testBotAgents() loads 34 cases from test-bot-agents.yml covering the regressions above, all explicit AI crawler entries, generic-matcher spot-checks, and pinned edge cases.