Skip to content

feat: add Braintrust-based single-turn LLM-as-judge eval framework CLOUDP-367319#1141

Merged
nima-taheri-mongodb merged 16 commits into
mainfrom
cloudp-367319_braintrust_llm-as-judge_poc
Jun 18, 2026
Merged

feat: add Braintrust-based single-turn LLM-as-judge eval framework CLOUDP-367319#1141
nima-taheri-mongodb merged 16 commits into
mainfrom
cloudp-367319_braintrust_llm-as-judge_poc

Conversation

@nima-taheri-mongodb

@nima-taheri-mongodb nima-taheri-mongodb commented May 4, 2026

Copy link
Copy Markdown
Collaborator

🎫 Ticket

CLOUDP-367319

📝 Description

Adds a self-contained accuracy eval framework for the MongoDB MCP tools under tests/eval/, backed by Braintrust for scoring & experiment tracking. 🧪

Each case runs a single-turn pipeline, per model:

  1. 🌱 Seed — spins up an isolated, per-case temp database (eval_<uuid>), loads bundled documents, and creates any indexes the case requests (incl. Atlas Search / Vector Search, waiting for readiness).
  2. 🤖 Run — the agent under test answers the prompt end-to-end using the MongoDB MCP tools (served in-process over an in-memory transport — no Docker-in-Docker, no stdio child).
  3. ⚖️ Judge — an LLM-as-judge scores the result against the case's llm_judge criteria, using read-only MCP tools plus synthetic helpers (get_conversation, get_response, submit_score).

Key design choices

  • 🗂️ Dataset-driven — cases come from a Braintrust dataset (initDataset), with input / expected schemas generated from zod types (pnpm eval:generate-schemas) so they can be attached to the dataset in the Braintrust UI. This unblocks the "define evals in the UI" workflow raised in earlier review.
  • 🎛️ ParameterisedconnectionString, model, and systemContext are Braintrust Eval parameters (overridable via UI, BT_EVAL_PARAMS_JSON, or env).
  • 🧬 Single-turn only — The agent gets one user turn and must finish autonomously.
  • ♻️ DB lifecycle is external — the eval never manages containers. pnpm eval:db-start / eval:db-stop run a healthy mongodb-atlas-local container locally (with an ngrok tip for remote/sandboxed runs if needed).
  • 📦 Deployablepnpm eval:push pre-bundles the eval and pushes it as a Braintrust sandboxed Eval function that can be used in Braintrust Playground for convenient dataset refinements.

Layout

Path Role
tests/eval/mongodb.eval.ts Braintrust Eval() entry — passing parameters, dataset, task logic, scoring
tests/eval/lib/datasetTypes.ts zod input / expected / output schemas and types for the eval
tests/eval/lib/user.ts runs the single agent turn (via Vercel AI SDK generateText)
tests/eval/lib/judge.ts LLM-as-judge over read-only MCP + synthetic tools (via Vercel AI SDK generateText)
tests/eval/lib/mcp.ts / shared.ts in-process MCP client + Mongo client / temp-db registry
tests/eval/lib/seeding.ts seeds collections, awaits search-index readiness
tests/eval/lib/tool/* synthetic judge tools (get_conversation, get_response, submit_score)
tests/eval/scripts/* DB lifecycle, schema generation, bundler for Braintrust push

Scripts

pnpm eval:db-start         # start local mongodb-atlas-local (waits until healthy and detaches)
pnpm eval:db-stop          # tear down the mongodb-atlas-local container
pnpm eval:run              # bt eval tests/eval/mongodb.eval.ts (runs the eval on current machine)
pnpm eval:debug            # tsx tests/eval/mongodb.eval.ts (runs the eval through direct TS entry, easier for debugging)
pnpm eval:push             # bundle + push as a Braintrust sandboxed Eval function (to be used in Braintrust Playground for dataset refinements)
pnpm eval:serve            # bt eval --dev tests/eval/mongodb.eval.ts (runs the eval as Braintrust dev server, could be used as a "remote eval" in Braintrust Playground)
pnpm eval:generate-schemas # emit dataset JSON schemas (to be applied to the Braintrust dataset input/expected schemas)

🧪 Documentation and Testing

  • Ran pnpm eval:run against a local mongodb-atlas-local instance ✅
  • Verified pnpm eval:push bundles + uploads the eval to Braintrust 🚀
  • pnpm eval:generate-schemas produces the dataset input/expected schemas 🗂️

✍️ Note

  • A BRAINTRUST_API_KEY key is only environment variable that is required to run the eval, currently Braintrust does not support offline evaluation.
    • This key is used to both leverage Braintrust AI gateway which is a model-agnostic gateway for all the models (instead of Azure key that was previously needed for running tests)
    • Also used to send the evaluation results as an experiment to Braintrust.
  • It's also possible to run the eval locally without the Braintrust API key, by using the eval:debug script.

🚀 Follow-up Work

  • Integrating this to post-merge CI pipeline
    • Will include setting Braintrust API key on the github project settings
  • Update Vercel AI SDK to the latest canary version (otherwise at least Gemini model has bugs and does not work)

@nima-taheri-mongodb nima-taheri-mongodb changed the title feat: initial poc CLOUDP-367319 add Braintrust-based LLM accuracy evaluation framework May 4, 2026
@nima-taheri-mongodb nima-taheri-mongodb force-pushed the cloudp-367319_braintrust_llm-as-judge_poc branch from 9c37bd6 to e42edfa Compare May 4, 2026 19:32

@nirinchev nirinchev left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a high-level, it looks reasonable, though I have some questions. Another question - my understanding was that with Braintrust, we are able to define evals in the UI and then we'd need to do some work to translate those into test cases that we run. Instead, the approach here is that we hardcode the evals in the MCP project and then only use braintrust for visualization of the results. Am I misunderstanding the value prop of Braintrust or should we aim to support this dynamic evals use case in some shape and form?

Comment thread tests/accuracy/sdk/constants.ts Outdated
const __dirname = fileURLToPath(import.meta.url);

export const ROOT_DIR = path.join(__dirname, "..", "..", "..", "..");
export const ROOT_DIR = process.cwd();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we change this - this will now depend on where the test is run from, which can cause issues.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, reverted it 🫡

Context: the reason why I made this change was because Braintrust bundler could not process this. but right now I pre-bundle the script before pushing it, which works.


const mflixMovies = {
collection: "movies",
documents: "tests/accuracy/test-data-dumps/mflix.movies-with-plot.json",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fully opposed to this, but should we instead set this up to use the default dataset instead?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem with the default sample dataset is its large size, which leads the LLM to use arbitrary pagination strategies for queries. This makes it difficult to write evaluation tests with deterministic assertions.

To address this, we use a much smaller movies dataset: a 30-document subset from the Atlas sample data's Mflix movies collection. This is the same selection used in the MCP Server accuracy tests.

},
},
assertions:
"The assistant is expected to return at least 1 document, the first returned result should be the document with id 'fbf30e42-ae6d-4775-bb3e-c5c127ddea06' from 'movies' collection.",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[q] should we be more specific with the assertions? E.g. it seems to me that the assistant using find and then manually processing the documents to find the desired one will pass, even though it didn't follow the instructions. Should we be evaluating for tool usage and argument shapes or are we fine with treating the MCP server as a black box and as long as we get the desired results, we don't care how the model got to them?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great question. Our goal in the Eval suite is to evaluate whether the assistant achieves the correct end result for the user, without prescribing exact tool calls or argument formats for the most part. This gives us flexibility as MongoDB query syntax and MCP tools evolve over time—for example, a recent proposal had aimed to add specialized lexical or vector search tools.

That said, there are scenarios where certain tool usage matters for performance reasons. We're planning to add constraints to the LLM-judge prompt so it considers tool choices and argument patterns when scoring. For instance, we might want to penalize use of the $regex operator if used in aggregate pipelines, or require that a search index should have been used. We expect these kinds of requirements to be enforced via the LLM-judge prompt, which will influence the assigned score accordingly.

// initialization when multiple Braintrust tasks start concurrently before the first setup completes.
function createLazyInfrastructure(
clusterConfig: MongoClusterConfiguration
): [getInfra: () => Promise<EvalInfrastructure>, closeInfra: () => Promise<void>] {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] it'd be more idiomatic to return an object rather than array here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair! I've done a major refactor of this code. It should now be cleaner now. 🤞

return;
}

const [getInfra, closeInfra] = createLazyInfrastructure(clusterConfig);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a separate cluster for each run, but the cluster itself is reused by models, right? Should we instead create clusters per test suite instead?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. I have reworked the evaluation logic so we no longer use a test container for every run. Instead, database lifecycle management is now handled externally by npm scripts (eval:db-start).

}

function braintrustNoSendLogs(): boolean {
return !process.env.BRAINTRUST_API_KEY;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it even make sense to run the test suites if we don't have an API key here? What would the outcome be?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, a Braintrust key is currently required to run the tests. After the rework, there's even a tighter coupling to Braintrust as we use Braintrust gateway as the model provider.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

This PR has gone 30 days without any activity and meets the project's definition of "stale". This will be auto-closed if there is no new activity over the next 30 days. If the issue is still relevant and active, you can simply comment with a "bump" to keep it open, or add the label "not_stale". Thanks for keeping our repository healthy!

@nima-taheri-mongodb nima-taheri-mongodb changed the title CLOUDP-367319 add Braintrust-based LLM accuracy evaluation framework CLOUDP-367319 add Braintrust-based single-turn LLM-as-judge eval framework Jun 10, 2026
@nima-taheri-mongodb nima-taheri-mongodb force-pushed the cloudp-367319_braintrust_llm-as-judge_poc branch from e0c64e4 to 8ab4d53 Compare June 10, 2026 05:15
@nima-taheri-mongodb

Copy link
Copy Markdown
Collaborator Author

On a high-level, it looks reasonable, though I have some questions. Another question - my understanding was that with Braintrust, we are able to define evals in the UI and then we'd need to do some work to translate those into test cases that we run. Instead, the approach here is that we hardcode the evals in the MCP project and then only use braintrust for visualization of the results. Am I misunderstanding the value prop of Braintrust or should we aim to support this dynamic evals use case in some shape and form?

Great feedback, thank you! I've revised the approach so that prompts (datasets) are stored in Braintrust, and our eval job pulls those dynamically from Braintrust to run them.

@nima-taheri-mongodb nima-taheri-mongodb marked this pull request as ready for review June 10, 2026 05:42
@nima-taheri-mongodb nima-taheri-mongodb requested a review from a team as a code owner June 10, 2026 05:42
@nima-taheri-mongodb nima-taheri-mongodb requested review from Copilot and cveticm and removed request for a team June 10, 2026 05:42
@nima-taheri-mongodb nima-taheri-mongodb changed the title CLOUDP-367319 add Braintrust-based single-turn LLM-as-judge eval framework feat: add Braintrust-based single-turn LLM-as-judge eval framework Jun 10, 2026
@nima-taheri-mongodb nima-taheri-mongodb changed the title feat: add Braintrust-based single-turn LLM-as-judge eval framework feat: add Braintrust-based single-turn LLM-as-judge eval framework CLOUDP-367319 Jun 10, 2026
@nima-taheri-mongodb nima-taheri-mongodb changed the title feat: add Braintrust-based single-turn LLM-as-judge eval framework CLOUDP-367319 feat: add Braintrust-based single-turn LLM-as-judge eval framework CLOUDP-367319 Jun 10, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Braintrust-backed, single-turn “LLM-as-judge” evaluation framework under tests/eval/ for running dataset-driven accuracy evals against the MongoDB MCP tools (in-process via an in-memory MCP transport), plus scripts for local DB lifecycle, schema generation, and Braintrust bundling/push.

Changes:

  • Introduces a dataset-driven eval runner (tests/eval/mongodb.eval.ts) with per-case temp DB seeding, single-turn agent execution, and an LLM judge that scores via synthetic helper tools.
  • Adds supporting eval libraries for MCP in-memory transport, seeding (including search/vector index readiness waiting), judge tooling, and schema/type definitions.
  • Adds scripts and package metadata for running the eval locally and bundling/pushing it to Braintrust.

Reviewed changes

Copilot reviewed 22 out of 25 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tsconfig.build.json Adjusts TS path mappings (note: trailing whitespace introduced).
pnpm-workspace.yaml Allows braintrust builds in the workspace.
package.json Adds eval scripts and devDependencies (braintrust, esbuild).
.vscode/launch.json Adds a “Debug Eval” launch configuration.
.gitignore Ignores Braintrust artifacts and eval bundle output.
tests/eval/mongodb.eval.ts Braintrust Eval() entry: dataset, parameters, seed/run/judge pipeline, cleanup reporter.
tests/eval/lib/mcp.ts In-memory MCP server/client wiring + tool wrapper for AI SDK.
tests/eval/lib/shared.ts Shared Mongo/MCP singletons + temp DB registry/teardown.
tests/eval/lib/seeding.ts Temp DB seeding + classic/search/vector index creation and readiness waiting.
tests/eval/lib/user.ts Single-turn agent runner using AI SDK generateText with MCP tools.
tests/eval/lib/judge.ts LLM-as-judge runner + tool filtering + synthetic score submission.
tests/eval/lib/scoring.ts Braintrust scorer pulling the judge verdict from task output.
tests/eval/lib/datasetTypes.ts Zod schemas/types for dataset input/expected/output and scorer args.
tests/eval/lib/datasetHelpers.ts Bundled seed document registry + db_seed parsing helpers.
tests/eval/lib/tool/getConversation.ts Synthetic tool exposing a serialized conversation transcript.
tests/eval/lib/tool/getResponse.ts Synthetic tool exposing the final assistant response text.
tests/eval/lib/tool/submitScore.ts Synthetic tool capturing/verifying judge score submission.
tests/eval/scripts/evalDb.sh Local mongodb-atlas-local container start/stop helper.
tests/eval/scripts/generateSchemas.ts Emits JSON Schemas from Zod types for Braintrust dataset schemas.
tests/eval/scripts/bundleEval.ts Custom esbuild bundle step with stubs/aliases for Braintrust push.
tests/eval/scripts/bundleEval/stub.mjs Generic stub for optional native/desktop deps during bundling.
tests/eval/scripts/bundleEval/osDnsNativeStub.cjs Stub implementation for os-dns-native via Node’s dns.
tests/eval/dbseed/mflix.movies.json Seed dataset JSON for eval cases.
tests/eval/dbseed/mflix.movies-with-plot-embedding.json Seed dataset JSON including embeddings for vector/search evals.

Comment thread tests/eval/lib/datasetHelpers.ts
Comment thread tests/eval/lib/datasetHelpers.ts
Comment thread tests/eval/scripts/generateSchemas.ts Outdated
Comment thread tests/eval/scripts/evalDb.sh Outdated
Comment thread tests/eval/scripts/evalDb.sh Outdated
Comment thread tests/eval/lib/mcp.ts
Comment thread tsconfig.build.json Outdated
Comment thread tests/eval/lib/tool/getConversation.ts Outdated
Comment thread tests/eval/lib/tool/getConversation.ts Outdated
@nima-taheri-mongodb

Copy link
Copy Markdown
Collaborator Author

Going to bed now 😴 would be great to wake up and see some feedback from Nikola! 🤩

@himanshusinghs himanshusinghs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall - looks good to me! I left small nits and a question.

Comment thread tests/eval/lib/tool/getConversation.ts Outdated
Comment thread tests/eval/lib/tool/getConversation.ts Outdated
Comment thread tests/eval/lib/tool/getConversation.ts Outdated
}
}

if (inner.length === 0) continue;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (inner.length === 0) continue;
if (inner.length === 0) {
continue;
}

Probably should enable prettier to do this.

Comment thread tests/eval/lib/tool/submitScore.ts
Comment thread tests/eval/lib/seeding.ts
Comment thread tests/eval/mongodb.eval.ts Outdated
Comment thread tsconfig.build.json

@nirinchev nirinchev left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems like a solid start. I have a couple of suggestions that I think we should try and address but those are not major blockers. We should also make sure to fix the tests.

Comment thread tests/eval/lib/judge.ts Outdated
Comment on lines +21 to +27
const BLACK_LISTED_PREFIXES = ["create", "drop", "delete", "update", "insert"];
return Object.fromEntries(
Object.entries(tools).filter(([name]) => {
const lower = name.toLowerCase();
return !BLACK_LISTED_PREFIXES.some((prefix) => lower.startsWith(prefix));
})
);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is brittle as at least some mutating tools don't start with these prefixes. A cleaner approach would be to get the judge a separate MCP server that is configured as readOnly: true.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's great suggestion!

Comment thread tests/eval/mongodb.eval.ts Outdated
const criteria = hooks.expected?.llm_judge;
if (criteria) {
judge = await judgeUsingLLM({
model,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using the same model for testing and then evaluation - we could consider using different ones to minimize any biases the model under evaluation has.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair, it's pretty easy to make this configurable.

Comment thread tests/eval/lib/shared.ts Outdated
* @param connectionString - The MongoDB connection string.
* @returns The MCP client.
*/
export async function getMcpClient(connectionString: string): Promise<McpClient> {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection string is ignored after the first call - as far as I can tell, this is okay for the current design, but is brittle and can lead to non-obvious errors further down the line. Consider caching the factories in a dictionary to ensure that consumers get the correct client.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, I prefer enforcing a single connection string if you don't mind. Will create a global connection-string and use that in all getMcpServer, getReadOnlyMcpServer, getMongoDbClient shared tools.

@nima-taheri-mongodb nima-taheri-mongodb enabled auto-merge (squash) June 16, 2026 13:22
@nima-taheri-mongodb

Copy link
Copy Markdown
Collaborator Author

We should also make sure to fix the tests.

I isolated these Evals into a separate directory (tests/eval) and did not make any change on existing unit/accuracy tests. The evals are also not integrated to CI. I don't expect any CI failure/issue caused by this PR.

@nima-taheri-mongodb

Copy link
Copy Markdown
Collaborator Author

Thanks for the good feedback! Applied all your suggestions @nirinchev 🫡

@nima-taheri-mongodb

nima-taheri-mongodb commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

I don't know why I see irrelevant code changes in this PR on those commits where I merged main branch into mine 🤔 let me rebase on main and remove those merge commits.

these are the new commits that addresses Nikola's comments
feat: use readOnly mcp-tool
feat: allow configuring judgeModel separately
feat: use a single shared connection string among other resources

@nima-taheri-mongodb nima-taheri-mongodb force-pushed the cloudp-367319_braintrust_llm-as-judge_poc branch from 4c2b392 to 6a8ae5d Compare June 17, 2026 05:52
@nima-taheri-mongodb nima-taheri-mongodb enabled auto-merge (squash) June 17, 2026 23:23
@nima-taheri-mongodb nima-taheri-mongodb merged commit df39438 into main Jun 18, 2026
40 of 42 checks passed
@nima-taheri-mongodb nima-taheri-mongodb deleted the cloudp-367319_braintrust_llm-as-judge_poc branch June 18, 2026 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants