Skip to content

Add AI bots to user-agent regex#148665

Merged
mhl-b merged 2 commits into
elastic:mainfrom
mhl-b:user-agent-fix
May 14, 2026
Merged

Add AI bots to user-agent regex#148665
mhl-b merged 2 commits into
elastic:mainfrom
mhl-b:user-agent-fix

Conversation

@mhl-b

@mhl-b mhl-b commented May 9, 2026

Copy link
Copy Markdown
Contributor

The user_agent processor misclassified AI crawler UA strings in two ways:

URL-suffix false positives. Crawlers like ChatGPT-User and meta-externalagent embed a +https://... URL whose path contains a bot/crawl token. The generic matcher picked up that URL token instead of the bot name, producing values like "com/bot" or "crawler".

No-keyword bots. Crawlers like MistralAI-User and Claude-User have no bot/spider/crawl in their name, so all generic patterns fell through to "Other".

Fix: add 9 explicit named entries for AI crawlers (OpenAI, Anthropic, Perplexity, Mistral, Meta, Cohere) before the generic matchers, and harden the generic [Bb]ot pattern with a (?<![./]) lookbehind to prevent matching tokens inside embedded URLs.

Tests: new data-driven testBotAgents() loads 34 cases from test-bot-agents.yml covering the regressions above, all explicit AI crawler entries, generic-matcher spot-checks, and pinned edge cases.

@mhl-b mhl-b added >enhancement :Distributed/Ingest Node Execution or management of Ingest Pipelines Team:Distributed Meta label for distributed team. labels May 9, 2026
@elasticsearchmachine

Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine

Copy link
Copy Markdown
Collaborator

Hi @mhl-b, I've created a changelog YAML for you.

@github-actions

github-actions Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

🔍 Preview links for changed docs

⏳ Building and deploying preview... View progress

This comment will be updated with preview links when the build is complete.

@github-actions

github-actions Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

@masseyke masseyke requested review from a team and masseyke May 11, 2026 17:43

@masseyke masseyke left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but it would probably be good to get someone on logstash to take a look to make sure we're staying consistent with the logstash user agent filter.

@masseyke

Copy link
Copy Markdown
Member

Maybe @mashhurs could take a look -- he's reviewed this processor before.

@mashhurs mashhurs left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhl-b, Logstash useragent-filter plugin uses ua-parser rule-sets. However, AI bots/agent user agent strings are not there yet. There is an open PR adds a support and your changes align with that. Feature wise, I have tested and reviewed your changes, no concerns.

I am curious why in ES rule-sets are maintained/copied instead consuming from upstream source? And it looks like ES regexes.yml is not synced regularly with ua-parser and that Logstash and ES useragent results might differ.

@mhl-b

mhl-b commented May 14, 2026

Copy link
Copy Markdown
Contributor Author

I am curious why in ES rule-sets are maintained/copied instead consuming from upstream source?

Maybe @masseyke can tell how it started. This code just recently become part of distributed-team umbrella and I'm new to this.

@mhl-b mhl-b merged commit 18eff55 into elastic:main May 14, 2026
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Ingest Node Execution or management of Ingest Pipelines >enhancement Team:Distributed Meta label for distributed team. v9.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants