feat: add Braintrust-based single-turn LLM-as-judge eval framework CLOUDP-367319#1141
Conversation
9c37bd6 to
e42edfa
Compare
nirinchev
left a comment
There was a problem hiding this comment.
On a high-level, it looks reasonable, though I have some questions. Another question - my understanding was that with Braintrust, we are able to define evals in the UI and then we'd need to do some work to translate those into test cases that we run. Instead, the approach here is that we hardcode the evals in the MCP project and then only use braintrust for visualization of the results. Am I misunderstanding the value prop of Braintrust or should we aim to support this dynamic evals use case in some shape and form?
| const __dirname = fileURLToPath(import.meta.url); | ||
|
|
||
| export const ROOT_DIR = path.join(__dirname, "..", "..", "..", ".."); | ||
| export const ROOT_DIR = process.cwd(); |
There was a problem hiding this comment.
Why did we change this - this will now depend on where the test is run from, which can cause issues.
There was a problem hiding this comment.
Fair, reverted it 🫡
Context: the reason why I made this change was because Braintrust bundler could not process this. but right now I pre-bundle the script before pushing it, which works.
|
|
||
| const mflixMovies = { | ||
| collection: "movies", | ||
| documents: "tests/accuracy/test-data-dumps/mflix.movies-with-plot.json", |
There was a problem hiding this comment.
Not fully opposed to this, but should we instead set this up to use the default dataset instead?
There was a problem hiding this comment.
The main problem with the default sample dataset is its large size, which leads the LLM to use arbitrary pagination strategies for queries. This makes it difficult to write evaluation tests with deterministic assertions.
To address this, we use a much smaller movies dataset: a 30-document subset from the Atlas sample data's Mflix movies collection. This is the same selection used in the MCP Server accuracy tests.
| }, | ||
| }, | ||
| assertions: | ||
| "The assistant is expected to return at least 1 document, the first returned result should be the document with id 'fbf30e42-ae6d-4775-bb3e-c5c127ddea06' from 'movies' collection.", |
There was a problem hiding this comment.
[q] should we be more specific with the assertions? E.g. it seems to me that the assistant using find and then manually processing the documents to find the desired one will pass, even though it didn't follow the instructions. Should we be evaluating for tool usage and argument shapes or are we fine with treating the MCP server as a black box and as long as we get the desired results, we don't care how the model got to them?
There was a problem hiding this comment.
That's a great question. Our goal in the Eval suite is to evaluate whether the assistant achieves the correct end result for the user, without prescribing exact tool calls or argument formats for the most part. This gives us flexibility as MongoDB query syntax and MCP tools evolve over time—for example, a recent proposal had aimed to add specialized lexical or vector search tools.
That said, there are scenarios where certain tool usage matters for performance reasons. We're planning to add constraints to the LLM-judge prompt so it considers tool choices and argument patterns when scoring. For instance, we might want to penalize use of the $regex operator if used in aggregate pipelines, or require that a search index should have been used. We expect these kinds of requirements to be enforced via the LLM-judge prompt, which will influence the assigned score accordingly.
| // initialization when multiple Braintrust tasks start concurrently before the first setup completes. | ||
| function createLazyInfrastructure( | ||
| clusterConfig: MongoClusterConfiguration | ||
| ): [getInfra: () => Promise<EvalInfrastructure>, closeInfra: () => Promise<void>] { |
There was a problem hiding this comment.
[nit] it'd be more idiomatic to return an object rather than array here.
There was a problem hiding this comment.
Fair! I've done a major refactor of this code. It should now be cleaner now. 🤞
| return; | ||
| } | ||
|
|
||
| const [getInfra, closeInfra] = createLazyInfrastructure(clusterConfig); |
There was a problem hiding this comment.
This creates a separate cluster for each run, but the cluster itself is reused by models, right? Should we instead create clusters per test suite instead?
There was a problem hiding this comment.
Great point. I have reworked the evaluation logic so we no longer use a test container for every run. Instead, database lifecycle management is now handled externally by npm scripts (eval:db-start).
| } | ||
|
|
||
| function braintrustNoSendLogs(): boolean { | ||
| return !process.env.BRAINTRUST_API_KEY; |
There was a problem hiding this comment.
Does it even make sense to run the test suites if we don't have an API key here? What would the outcome be?
There was a problem hiding this comment.
Unfortunately, a Braintrust key is currently required to run the tests. After the rework, there's even a tighter coupling to Braintrust as we use Braintrust gateway as the model provider.
|
This PR has gone 30 days without any activity and meets the project's definition of "stale". This will be auto-closed if there is no new activity over the next 30 days. If the issue is still relevant and active, you can simply comment with a "bump" to keep it open, or add the label "not_stale". Thanks for keeping our repository healthy! |
e0c64e4 to
8ab4d53
Compare
Great feedback, thank you! I've revised the approach so that prompts (datasets) are stored in Braintrust, and our eval job pulls those dynamically from Braintrust to run them. |
There was a problem hiding this comment.
Pull request overview
Adds a Braintrust-backed, single-turn “LLM-as-judge” evaluation framework under tests/eval/ for running dataset-driven accuracy evals against the MongoDB MCP tools (in-process via an in-memory MCP transport), plus scripts for local DB lifecycle, schema generation, and Braintrust bundling/push.
Changes:
- Introduces a dataset-driven eval runner (
tests/eval/mongodb.eval.ts) with per-case temp DB seeding, single-turn agent execution, and an LLM judge that scores via synthetic helper tools. - Adds supporting eval libraries for MCP in-memory transport, seeding (including search/vector index readiness waiting), judge tooling, and schema/type definitions.
- Adds scripts and package metadata for running the eval locally and bundling/pushing it to Braintrust.
Reviewed changes
Copilot reviewed 22 out of 25 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tsconfig.build.json | Adjusts TS path mappings (note: trailing whitespace introduced). |
| pnpm-workspace.yaml | Allows braintrust builds in the workspace. |
| package.json | Adds eval scripts and devDependencies (braintrust, esbuild). |
| .vscode/launch.json | Adds a “Debug Eval” launch configuration. |
| .gitignore | Ignores Braintrust artifacts and eval bundle output. |
| tests/eval/mongodb.eval.ts | Braintrust Eval() entry: dataset, parameters, seed/run/judge pipeline, cleanup reporter. |
| tests/eval/lib/mcp.ts | In-memory MCP server/client wiring + tool wrapper for AI SDK. |
| tests/eval/lib/shared.ts | Shared Mongo/MCP singletons + temp DB registry/teardown. |
| tests/eval/lib/seeding.ts | Temp DB seeding + classic/search/vector index creation and readiness waiting. |
| tests/eval/lib/user.ts | Single-turn agent runner using AI SDK generateText with MCP tools. |
| tests/eval/lib/judge.ts | LLM-as-judge runner + tool filtering + synthetic score submission. |
| tests/eval/lib/scoring.ts | Braintrust scorer pulling the judge verdict from task output. |
| tests/eval/lib/datasetTypes.ts | Zod schemas/types for dataset input/expected/output and scorer args. |
| tests/eval/lib/datasetHelpers.ts | Bundled seed document registry + db_seed parsing helpers. |
| tests/eval/lib/tool/getConversation.ts | Synthetic tool exposing a serialized conversation transcript. |
| tests/eval/lib/tool/getResponse.ts | Synthetic tool exposing the final assistant response text. |
| tests/eval/lib/tool/submitScore.ts | Synthetic tool capturing/verifying judge score submission. |
| tests/eval/scripts/evalDb.sh | Local mongodb-atlas-local container start/stop helper. |
| tests/eval/scripts/generateSchemas.ts | Emits JSON Schemas from Zod types for Braintrust dataset schemas. |
| tests/eval/scripts/bundleEval.ts | Custom esbuild bundle step with stubs/aliases for Braintrust push. |
| tests/eval/scripts/bundleEval/stub.mjs | Generic stub for optional native/desktop deps during bundling. |
| tests/eval/scripts/bundleEval/osDnsNativeStub.cjs | Stub implementation for os-dns-native via Node’s dns. |
| tests/eval/dbseed/mflix.movies.json | Seed dataset JSON for eval cases. |
| tests/eval/dbseed/mflix.movies-with-plot-embedding.json | Seed dataset JSON including embeddings for vector/search evals. |
|
Going to bed now 😴 would be great to wake up and see some feedback from Nikola! 🤩 |
himanshusinghs
left a comment
There was a problem hiding this comment.
Overall - looks good to me! I left small nits and a question.
| } | ||
| } | ||
|
|
||
| if (inner.length === 0) continue; |
There was a problem hiding this comment.
| if (inner.length === 0) continue; | |
| if (inner.length === 0) { | |
| continue; | |
| } |
Probably should enable prettier to do this.
| const BLACK_LISTED_PREFIXES = ["create", "drop", "delete", "update", "insert"]; | ||
| return Object.fromEntries( | ||
| Object.entries(tools).filter(([name]) => { | ||
| const lower = name.toLowerCase(); | ||
| return !BLACK_LISTED_PREFIXES.some((prefix) => lower.startsWith(prefix)); | ||
| }) | ||
| ); |
There was a problem hiding this comment.
This is brittle as at least some mutating tools don't start with these prefixes. A cleaner approach would be to get the judge a separate MCP server that is configured as readOnly: true.
There was a problem hiding this comment.
that's great suggestion!
| const criteria = hooks.expected?.llm_judge; | ||
| if (criteria) { | ||
| judge = await judgeUsingLLM({ | ||
| model, |
There was a problem hiding this comment.
This is using the same model for testing and then evaluation - we could consider using different ones to minimize any biases the model under evaluation has.
There was a problem hiding this comment.
fair, it's pretty easy to make this configurable.
| * @param connectionString - The MongoDB connection string. | ||
| * @returns The MCP client. | ||
| */ | ||
| export async function getMcpClient(connectionString: string): Promise<McpClient> { |
There was a problem hiding this comment.
The connection string is ignored after the first call - as far as I can tell, this is okay for the current design, but is brittle and can lead to non-obvious errors further down the line. Consider caching the factories in a dictionary to ensure that consumers get the correct client.
There was a problem hiding this comment.
Fair, I prefer enforcing a single connection string if you don't mind. Will create a global connection-string and use that in all getMcpServer, getReadOnlyMcpServer, getMongoDbClient shared tools.
I isolated these Evals into a separate directory ( |
|
Thanks for the good feedback! Applied all your suggestions @nirinchev 🫡 |
|
I don't know why I see irrelevant code changes in this PR on those commits where I merged |
Co-authored-by: Cursor <cursoragent@cursor.com>
4c2b392 to
6a8ae5d
Compare
🎫 Ticket
CLOUDP-367319
📝 Description
Adds a self-contained accuracy eval framework for the MongoDB MCP tools under
tests/eval/, backed by Braintrust for scoring & experiment tracking. 🧪Each case runs a single-turn pipeline, per model:
eval_<uuid>), loads bundled documents, and creates any indexes the case requests (incl. Atlas Search / Vector Search, waiting for readiness).llm_judgecriteria, using read-only MCP tools plus synthetic helpers (get_conversation,get_response,submit_score).Key design choices
initDataset), withinput/expectedschemas generated from zod types (pnpm eval:generate-schemas) so they can be attached to the dataset in the Braintrust UI. This unblocks the "define evals in the UI" workflow raised in earlier review.connectionString,model, andsystemContextare BraintrustEvalparameters (overridable via UI,BT_EVAL_PARAMS_JSON, or env).pnpm eval:db-start/eval:db-stoprun a healthymongodb-atlas-localcontainer locally (with an ngrok tip for remote/sandboxed runs if needed).pnpm eval:pushpre-bundles the eval and pushes it as a Braintrust sandboxed Eval function that can be used in Braintrust Playground for convenient dataset refinements.Layout
tests/eval/mongodb.eval.tsEval()entry — passing parameters, dataset, task logic, scoringtests/eval/lib/datasetTypes.tsinput/expected/outputschemas and types for the evaltests/eval/lib/user.tsgenerateText)tests/eval/lib/judge.tsgenerateText)tests/eval/lib/mcp.ts/shared.tstests/eval/lib/seeding.tstests/eval/lib/tool/*get_conversation,get_response,submit_score)tests/eval/scripts/*Scripts
🧪 Documentation and Testing
pnpm eval:runagainst a localmongodb-atlas-localinstance ✅pnpm eval:pushbundles + uploads the eval to Braintrust 🚀pnpm eval:generate-schemasproduces the datasetinput/expectedschemas 🗂️✍️ Note
BRAINTRUST_API_KEYkey is only environment variable that is required to run the eval, currently Braintrust does not support offline evaluation.eval:debugscript.🚀 Follow-up Work