feat: add Braintrust-based single-turn LLM-as-judge eval framework CLOUDP-367319 by nima-taheri-mongodb · Pull Request #1141 · mongodb-js/mongodb-mcp-server

nima-taheri-mongodb · 2026-05-04T19:11:18Z

🎫 Ticket

📝 Description

Adds a self-contained accuracy eval framework for the MongoDB MCP tools under tests/eval/, backed by Braintrust for scoring & experiment tracking. 🧪

Each case runs a single-turn pipeline, per model:

🌱 Seed — spins up an isolated, per-case temp database (eval_<uuid>), loads bundled documents, and creates any indexes the case requests (incl. Atlas Search / Vector Search, waiting for readiness).
🤖 Run — the agent under test answers the prompt end-to-end using the MongoDB MCP tools (served in-process over an in-memory transport — no Docker-in-Docker, no stdio child).
⚖️ Judge — an LLM-as-judge scores the result against the case's llm_judge criteria, using read-only MCP tools plus synthetic helpers (get_conversation, get_response, submit_score).

Key design choices

🗂️ Dataset-driven — cases come from a Braintrust dataset (initDataset), with input / expected schemas generated from zod types (pnpm eval:generate-schemas) so they can be attached to the dataset in the Braintrust UI. This unblocks the "define evals in the UI" workflow raised in earlier review.
🎛️ Parameterised — connectionString, model, and systemContext are Braintrust Eval parameters (overridable via UI, BT_EVAL_PARAMS_JSON, or env).
🧬 Single-turn only — The agent gets one user turn and must finish autonomously.
♻️ DB lifecycle is external — the eval never manages containers. pnpm eval:db-start / eval:db-stop run a healthy mongodb-atlas-local container locally (with an ngrok tip for remote/sandboxed runs if needed).
📦 Deployable — pnpm eval:push pre-bundles the eval and pushes it as a Braintrust sandboxed Eval function that can be used in Braintrust Playground for convenient dataset refinements.

Layout

Path	Role
`tests/eval/mongodb.eval.ts`	Braintrust `Eval()` entry — passing parameters, dataset, task logic, scoring
`tests/eval/lib/datasetTypes.ts`	zod `input` / `expected` / `output` schemas and types for the eval
`tests/eval/lib/user.ts`	runs the single agent turn (via Vercel AI SDK `generateText`)
`tests/eval/lib/judge.ts`	LLM-as-judge over read-only MCP + synthetic tools (via Vercel AI SDK `generateText`)
`tests/eval/lib/mcp.ts` / `shared.ts`	in-process MCP client + Mongo client / temp-db registry
`tests/eval/lib/seeding.ts`	seeds collections, awaits search-index readiness
`tests/eval/lib/tool/*`	synthetic judge tools (`get_conversation`, `get_response`, `submit_score`)
`tests/eval/scripts/*`	DB lifecycle, schema generation, bundler for Braintrust push

Scripts

pnpm eval:db-start         # start local mongodb-atlas-local (waits until healthy and detaches)
pnpm eval:db-stop          # tear down the mongodb-atlas-local container
pnpm eval:run              # bt eval tests/eval/mongodb.eval.ts (runs the eval on current machine)
pnpm eval:debug            # tsx tests/eval/mongodb.eval.ts (runs the eval through direct TS entry, easier for debugging)
pnpm eval:push             # bundle + push as a Braintrust sandboxed Eval function (to be used in Braintrust Playground for dataset refinements)
pnpm eval:serve            # bt eval --dev tests/eval/mongodb.eval.ts (runs the eval as Braintrust dev server, could be used as a "remote eval" in Braintrust Playground)
pnpm eval:generate-schemas # emit dataset JSON schemas (to be applied to the Braintrust dataset input/expected schemas)

🧪 Documentation and Testing

Ran pnpm eval:run against a local mongodb-atlas-local instance ✅
Verified pnpm eval:push bundles + uploads the eval to Braintrust 🚀
pnpm eval:generate-schemas produces the dataset input/expected schemas 🗂️

✍️ Note

A BRAINTRUST_API_KEY key is only environment variable that is required to run the eval, currently Braintrust does not support offline evaluation.
- This key is used to both leverage Braintrust AI gateway which is a model-agnostic gateway for all the models (instead of Azure key that was previously needed for running tests)
- Also used to send the evaluation results as an experiment to Braintrust.
It's also possible to run the eval locally without the Braintrust API key, by using the eval:debug script.

🚀 Follow-up Work

Integrating this to post-merge CI pipeline
- Will include setting Braintrust API key on the github project settings
Update Vercel AI SDK to the latest canary version (otherwise at least Gemini model has bugs and does not work)

nirinchev

On a high-level, it looks reasonable, though I have some questions. Another question - my understanding was that with Braintrust, we are able to define evals in the UI and then we'd need to do some work to translate those into test cases that we run. Instead, the approach here is that we hardcode the evals in the MCP project and then only use braintrust for visualization of the results. Am I misunderstanding the value prop of Braintrust or should we aim to support this dynamic evals use case in some shape and form?

nirinchev · 2026-05-08T09:11:56Z

-const __dirname = fileURLToPath(import.meta.url);
-
-export const ROOT_DIR = path.join(__dirname, "..", "..", "..", "..");
+export const ROOT_DIR = process.cwd();


Why did we change this - this will now depend on where the test is run from, which can cause issues.

Fair, reverted it 🫡

Context: the reason why I made this change was because Braintrust bundler could not process this. but right now I pre-bundle the script before pushing it, which works.

nirinchev · 2026-05-08T09:13:15Z

+
+const mflixMovies = {
+    collection: "movies",
+    documents: "tests/accuracy/test-data-dumps/mflix.movies-with-plot.json",


Not fully opposed to this, but should we instead set this up to use the default dataset instead?

The main problem with the default sample dataset is its large size, which leads the LLM to use arbitrary pagination strategies for queries. This makes it difficult to write evaluation tests with deterministic assertions.

To address this, we use a much smaller movies dataset: a 30-document subset from the Atlas sample data's Mflix movies collection. This is the same selection used in the MCP Server accuracy tests.

nirinchev · 2026-05-08T09:27:18Z

+                },
+            },
+            assertions:
+                "The assistant is expected to return at least 1 document, the first returned result should be the document with id 'fbf30e42-ae6d-4775-bb3e-c5c127ddea06' from 'movies' collection.",


[q] should we be more specific with the assertions? E.g. it seems to me that the assistant using find and then manually processing the documents to find the desired one will pass, even though it didn't follow the instructions. Should we be evaluating for tool usage and argument shapes or are we fine with treating the MCP server as a black box and as long as we get the desired results, we don't care how the model got to them?

That's a great question. Our goal in the Eval suite is to evaluate whether the assistant achieves the correct end result for the user, without prescribing exact tool calls or argument formats for the most part. This gives us flexibility as MongoDB query syntax and MCP tools evolve over time—for example, a recent proposal had aimed to add specialized lexical or vector search tools.

That said, there are scenarios where certain tool usage matters for performance reasons. We're planning to add constraints to the LLM-judge prompt so it considers tool choices and argument patterns when scoring. For instance, we might want to penalize use of the $regex operator if used in aggregate pipelines, or require that a search index should have been used. We expect these kinds of requirements to be enforced via the LLM-judge prompt, which will influence the assigned score accordingly.

nirinchev · 2026-05-08T09:31:05Z

+// initialization when multiple Braintrust tasks start concurrently before the first setup completes.
+function createLazyInfrastructure(
+    clusterConfig: MongoClusterConfiguration
+): [getInfra: () => Promise<EvalInfrastructure>, closeInfra: () => Promise<void>] {


[nit] it'd be more idiomatic to return an object rather than array here.

Fair! I've done a major refactor of this code. It should now be cleaner now. 🤞

nirinchev · 2026-05-08T09:43:27Z

+        return;
+    }
+
+    const [getInfra, closeInfra] = createLazyInfrastructure(clusterConfig);


This creates a separate cluster for each run, but the cluster itself is reused by models, right? Should we instead create clusters per test suite instead?

Great point. I have reworked the evaluation logic so we no longer use a test container for every run. Instead, database lifecycle management is now handled externally by npm scripts (eval:db-start).

nirinchev · 2026-05-08T09:46:06Z

+}
+
+function braintrustNoSendLogs(): boolean {
+    return !process.env.BRAINTRUST_API_KEY;


Does it even make sense to run the test suites if we don't have an API key here? What would the outcome be?

Unfortunately, a Braintrust key is currently required to run the tests. After the rework, there's even a tighter coupling to Braintrust as we use Braintrust gateway as the model provider.

github-actions · 2026-06-08T00:52:32Z

This PR has gone 30 days without any activity and meets the project's definition of "stale". This will be auto-closed if there is no new activity over the next 30 days. If the issue is still relevant and active, you can simply comment with a "bump" to keep it open, or add the label "not_stale". Thanks for keeping our repository healthy!

nima-taheri-mongodb · 2026-06-10T05:36:56Z

On a high-level, it looks reasonable, though I have some questions. Another question - my understanding was that with Braintrust, we are able to define evals in the UI and then we'd need to do some work to translate those into test cases that we run. Instead, the approach here is that we hardcode the evals in the MCP project and then only use braintrust for visualization of the results. Am I misunderstanding the value prop of Braintrust or should we aim to support this dynamic evals use case in some shape and form?

Great feedback, thank you! I've revised the approach so that prompts (datasets) are stored in Braintrust, and our eval job pulls those dynamically from Braintrust to run them.

Copilot

Pull request overview

Adds a Braintrust-backed, single-turn “LLM-as-judge” evaluation framework under tests/eval/ for running dataset-driven accuracy evals against the MongoDB MCP tools (in-process via an in-memory MCP transport), plus scripts for local DB lifecycle, schema generation, and Braintrust bundling/push.

Changes:

Introduces a dataset-driven eval runner (tests/eval/mongodb.eval.ts) with per-case temp DB seeding, single-turn agent execution, and an LLM judge that scores via synthetic helper tools.
Adds supporting eval libraries for MCP in-memory transport, seeding (including search/vector index readiness waiting), judge tooling, and schema/type definitions.
Adds scripts and package metadata for running the eval locally and bundling/pushing it to Braintrust.

Reviewed changes

Copilot reviewed 22 out of 25 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
tsconfig.build.json	Adjusts TS path mappings (note: trailing whitespace introduced).
pnpm-workspace.yaml	Allows `braintrust` builds in the workspace.
package.json	Adds eval scripts and devDependencies (`braintrust`, `esbuild`).
.vscode/launch.json	Adds a “Debug Eval” launch configuration.
.gitignore	Ignores Braintrust artifacts and eval bundle output.
tests/eval/mongodb.eval.ts	Braintrust `Eval()` entry: dataset, parameters, seed/run/judge pipeline, cleanup reporter.
tests/eval/lib/mcp.ts	In-memory MCP server/client wiring + tool wrapper for AI SDK.
tests/eval/lib/shared.ts	Shared Mongo/MCP singletons + temp DB registry/teardown.
tests/eval/lib/seeding.ts	Temp DB seeding + classic/search/vector index creation and readiness waiting.
tests/eval/lib/user.ts	Single-turn agent runner using AI SDK `generateText` with MCP tools.
tests/eval/lib/judge.ts	LLM-as-judge runner + tool filtering + synthetic score submission.
tests/eval/lib/scoring.ts	Braintrust scorer pulling the judge verdict from task output.
tests/eval/lib/datasetTypes.ts	Zod schemas/types for dataset `input`/`expected`/`output` and scorer args.
tests/eval/lib/datasetHelpers.ts	Bundled seed document registry + db_seed parsing helpers.
tests/eval/lib/tool/getConversation.ts	Synthetic tool exposing a serialized conversation transcript.
tests/eval/lib/tool/getResponse.ts	Synthetic tool exposing the final assistant response text.
tests/eval/lib/tool/submitScore.ts	Synthetic tool capturing/verifying judge score submission.
tests/eval/scripts/evalDb.sh	Local `mongodb-atlas-local` container start/stop helper.
tests/eval/scripts/generateSchemas.ts	Emits JSON Schemas from Zod types for Braintrust dataset schemas.
tests/eval/scripts/bundleEval.ts	Custom esbuild bundle step with stubs/aliases for Braintrust push.
tests/eval/scripts/bundleEval/stub.mjs	Generic stub for optional native/desktop deps during bundling.
tests/eval/scripts/bundleEval/osDnsNativeStub.cjs	Stub implementation for `os-dns-native` via Node’s `dns`.
tests/eval/dbseed/mflix.movies.json	Seed dataset JSON for eval cases.
tests/eval/dbseed/mflix.movies-with-plot-embedding.json	Seed dataset JSON including embeddings for vector/search evals.

nima-taheri-mongodb · 2026-06-10T06:09:18Z

Going to bed now 😴 would be great to wake up and see some feedback from Nikola! 🤩

himanshusinghs

Overall - looks good to me! I left small nits and a question.

himanshusinghs · 2026-06-12T08:50:19Z

+            }
+        }
+
+        if (inner.length === 0) continue;


Suggested change

if (inner.length === 0) continue;

if (inner.length === 0) {

continue;

}

Probably should enable prettier to do this.

nirinchev

Overall seems like a solid start. I have a couple of suggestions that I think we should try and address but those are not major blockers. We should also make sure to fix the tests.

nirinchev · 2026-06-16T08:31:47Z

+    const BLACK_LISTED_PREFIXES = ["create", "drop", "delete", "update", "insert"];
+    return Object.fromEntries(
+        Object.entries(tools).filter(([name]) => {
+            const lower = name.toLowerCase();
+            return !BLACK_LISTED_PREFIXES.some((prefix) => lower.startsWith(prefix));
+        })
+    );


This is brittle as at least some mutating tools don't start with these prefixes. A cleaner approach would be to get the judge a separate MCP server that is configured as readOnly: true.

that's great suggestion!

nirinchev · 2026-06-16T08:41:57Z

+                const criteria = hooks.expected?.llm_judge;
+                if (criteria) {
+                    judge = await judgeUsingLLM({
+                        model,


This is using the same model for testing and then evaluation - we could consider using different ones to minimize any biases the model under evaluation has.

fair, it's pretty easy to make this configurable.

nirinchev · 2026-06-16T08:44:25Z

+ * @param connectionString - The MongoDB connection string.
+ * @returns The MCP client.
+ */
+export async function getMcpClient(connectionString: string): Promise<McpClient> {


The connection string is ignored after the first call - as far as I can tell, this is okay for the current design, but is brittle and can lead to non-obvious errors further down the line. Consider caching the factories in a dictionary to ensure that consumers get the correct client.

Fair, I prefer enforcing a single connection string if you don't mind. Will create a global connection-string and use that in all getMcpServer, getReadOnlyMcpServer, getMongoDbClient shared tools.

nima-taheri-mongodb · 2026-06-17T05:22:18Z

We should also make sure to fix the tests.

I isolated these Evals into a separate directory (tests/eval) and did not make any change on existing unit/accuracy tests. The evals are also not integrated to CI. I don't expect any CI failure/issue caused by this PR.

nima-taheri-mongodb · 2026-06-17T05:27:05Z

Thanks for the good feedback! Applied all your suggestions @nirinchev 🫡

nima-taheri-mongodb · 2026-06-17T05:31:22Z

I don't know why I see irrelevant code changes in this PR on those commits where I merged main branch into mine 🤔 let me rebase on main and remove those merge commits.
—
these are the new commits that addresses Nikola's comments
feat: use readOnly mcp-tool
feat: allow configuring judgeModel separately
feat: use a single shared connection string among other resources

Co-authored-by: Cursor <cursoragent@cursor.com>

nima-taheri-mongodb changed the title ~~feat: initial poc~~ CLOUDP-367319 add Braintrust-based LLM accuracy evaluation framework May 4, 2026

nima-taheri-mongodb force-pushed the cloudp-367319_braintrust_llm-as-judge_poc branch from 9c37bd6 to e42edfa Compare May 4, 2026 19:32

github-actions Bot added the type: chore label May 4, 2026

nirinchev reviewed May 8, 2026

View reviewed changes

github-actions Bot added the no-pr-activity label Jun 8, 2026

nima-taheri-mongodb changed the title ~~CLOUDP-367319 add Braintrust-based LLM accuracy evaluation framework~~ CLOUDP-367319 add Braintrust-based single-turn LLM-as-judge eval framework Jun 10, 2026

nima-taheri-mongodb force-pushed the cloudp-367319_braintrust_llm-as-judge_poc branch from e0c64e4 to 8ab4d53 Compare June 10, 2026 05:15

nima-taheri-mongodb marked this pull request as ready for review June 10, 2026 05:42

nima-taheri-mongodb requested a review from a team as a code owner June 10, 2026 05:42

nima-taheri-mongodb requested review from Copilot and cveticm and removed request for a team June 10, 2026 05:42

Copilot started reviewing on behalf of nima-taheri-mongodb June 10, 2026 05:42 View session

nima-taheri-mongodb requested a review from nirinchev June 10, 2026 05:42

nima-taheri-mongodb changed the title ~~CLOUDP-367319 add Braintrust-based single-turn LLM-as-judge eval framework~~ feat: add Braintrust-based single-turn LLM-as-judge eval framework Jun 10, 2026

nima-taheri-mongodb changed the title ~~feat: add Braintrust-based single-turn LLM-as-judge eval framework~~ feat: add Braintrust-based single-turn LLM-as-judge eval framework CLOUDP-367319 Jun 10, 2026

nima-taheri-mongodb changed the title ~~feat: add Braintrust-based single-turn LLM-as-judge eval framework CLOUDP-367319~~ feat: add Braintrust-based single-turn LLM-as-judge eval framework CLOUDP-367319 Jun 10, 2026

github-actions Bot added type: feature and removed type: chore labels Jun 10, 2026

Copilot AI reviewed Jun 10, 2026

View reviewed changes

nima-taheri-mongodb added the not_stale label Jun 10, 2026

himanshusinghs approved these changes Jun 12, 2026

View reviewed changes

nima-taheri-mongodb requested a review from himanshusinghs June 15, 2026 01:59

himanshusinghs approved these changes Jun 15, 2026

View reviewed changes

nirinchev reviewed Jun 16, 2026

View reviewed changes

nima-taheri-mongodb enabled auto-merge (squash) June 16, 2026 13:22

nima-taheri-mongodb disabled auto-merge June 16, 2026 13:23

nima-taheri-mongodb requested a review from nirinchev June 17, 2026 05:26

nima-taheri-mongodb and others added 14 commits June 16, 2026 22:44

feat: initial poc

53a9e36

add movies-with-plot-embedding sample dataset

883af95

minor changes

319c26c

move eval to tests folder

87f61dc

big update

c4d9efc

review: copilot

452fc00

chore: rename dbseed -> dbSeed to match import casing

cfc7059

Co-authored-by: Cursor <cursoragent@cursor.com>

stub more native libraries

b920296

feat: workaround BT dev server issue

a3e507e

fix: better singleton factory for shared resources

24d88bb

refactor: review comments

74ad65a

feat: use readOnly mcp-tool

f4d64c3

feat: allow configuring judgeModel separately

3f6771c

feat: use a single shared connection string among other resources

6a8ae5d

nima-taheri-mongodb force-pushed the cloudp-367319_braintrust_llm-as-judge_poc branch from 4c2b392 to 6a8ae5d Compare June 17, 2026 05:52

nima-taheri-mongodb added 2 commits June 17, 2026 13:50

fix: prettier, dependencies checks

597dcfd

Merge branch 'main' into cloudp-367319_braintrust_llm-as-judge_poc

882fdbf

nima-taheri-mongodb enabled auto-merge (squash) June 17, 2026 23:23

nima-taheri-mongodb disabled auto-merge June 17, 2026 23:24

nima-taheri-mongodb merged commit df39438 into main Jun 18, 2026
40 of 42 checks passed

nima-taheri-mongodb deleted the cloudp-367319_braintrust_llm-as-judge_poc branch June 18, 2026 18:04

Conversation

nima-taheri-mongodb commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎫 Ticket

📝 Description

Key design choices

Layout

Scripts

🧪 Documentation and Testing

✍️ Note

🚀 Follow-up Work

Uh oh!

nirinchev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

nima-taheri-mongodb commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nima-taheri-mongodb commented Jun 10, 2026

Uh oh!

himanshusinghs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nirinchev left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nima-taheri-mongodb commented May 4, 2026 •

edited

Loading

nirinchev left a comment •

edited

Loading

nima-taheri-mongodb commented Jun 17, 2026 •

edited

Loading