How to Use AI Agents to Run LLM Benchmarks: A Custom Evaluation Framework

Why Public LLM Benchmarks Don’t Tell You What You Need to Know

If you’ve spent any time evaluating large language models, you’ve probably noticed the gap between benchmark scores and real-world performance. A model that aces MMLU or HumanEval can still fumble on the specific tasks your application actually needs — structured output generation, domain-specific reasoning, consistent JSON formatting under edge cases.

Public LLM benchmarks are useful for broad comparisons. But the moment you’re building something real, they stop being sufficient. What you actually need is a custom evaluation framework — one that tests the models you’re considering against your tasks, your prompts, and your quality criteria.

The good news: AI agents are a natural fit for running these evaluations. They can generate test cases, call multiple models in parallel, score outputs, and log results — all without constant human intervention. This post walks through how to build a custom LLM benchmark using a multi-agent approach, including a worked example based on a physics-style “gravity-well” benchmark one developer built to test spatial reasoning.

The Problem with Off-the-Shelf Benchmarks

Standard benchmarks like MMLU, GSM8K, and HellaSwag measure specific, standardized capabilities. They’re reproducible and useful for tracking progress across model generations. But they have some significant limitations for applied teams.

They measure what’s easy to measure, not what matters for your use case.

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

Most public benchmarks test factual recall, basic math, or natural language understanding in general-purpose formats. If your application needs a model to generate valid YAML, follow a specific output schema, or reason about cause-and-effect in a particular domain, those benchmarks won’t tell you much.

Leakage is a real problem.

There’s growing evidence that popular benchmarks have leaked into the training data of many frontier models. When a model scores 90% on a benchmark it’s seen variations of during training, that score tells you more about memorization than capability.

They’re static.

Benchmarks capture a snapshot. Your prompts evolve. Your use cases change. Public benchmarks don’t adapt with you.

Building a custom evaluation system solves all three problems — but the manual work involved can be prohibitive. That’s where agents come in.

What a Custom LLM Evaluation Framework Actually Includes

Before getting into agent architecture, it helps to understand what a complete evaluation system needs to do. At minimum, you need:

Test Case Generation

You need a diverse set of inputs that cover normal cases, edge cases, and adversarial inputs. These can come from real production data, synthetic generation, or a mix of both.

For a reasoning task, that might mean:

Simple cases with obvious answers
Cases that require multi-step inference
Ambiguous inputs where the model should express uncertainty
Inputs that look like they have obvious answers but don’t

Model Runners

The framework needs to call each model being evaluated with the same inputs, under comparable conditions. This includes temperature settings, system prompts, and retry logic for failures.

Output Scoring

This is where most custom frameworks get interesting. You have a few options:

Rule-based scoring: Does the output contain the expected string? Is the JSON valid? Did the model follow the format?
LLM-as-judge: Use another model to score the outputs against criteria.
Human-in-the-loop: Sample outputs for human review, especially for subjective quality.

Most real systems combine all three.

Result Aggregation and Storage

You need to store raw outputs, scores, and metadata so you can analyze results across runs, compare models over time, and trace failures back to specific inputs.

Reporting

A summary view that makes results actionable — ideally broken down by category, difficulty level, or failure mode.

The Gravity-Well Benchmark: A Case Study in Domain-Specific Testing

One developer built what they called a “gravity-well” benchmark to test spatial and physical reasoning in LLMs. The premise is simple but clever: describe a simulated environment with objects of different masses and ask the model to predict trajectories, relative forces, or the behavior of a test particle.

The benchmark tests several things at once:

Can the model hold multiple spatial relationships in working memory?
Does it apply consistent reasoning about forces across different configurations?
Does it know when to express uncertainty?
Can it format outputs correctly under constraints?

None of these are well-captured by standard benchmarks. HumanEval tests code generation. GSM8K tests grade-school arithmetic. But applying gravitational principles to a described environment while maintaining spatial consistency? That’s a gap.

Why This Approach Works

The gravity-well setup has a few properties that make it a good benchmark template:

Not a coding agent. A product manager.

Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.

BY MINDSTUDIO

Scalable difficulty. You can create trivial cases (two objects, obvious outcome) or complex ones (five objects with competing gravitational effects, asymmetric initial conditions). This gives you a difficulty curve to test against.

Objective scoring. Physics has right answers. You can compute the expected result and check the model’s output against it, with tolerance ranges for approximations.

Resistant to memorization. The specific configurations are generated fresh, so models can’t recall them from training data.

Domain-transferable. The same framework structure works for any domain with computable ground truth — chemistry, logistics, financial modeling.

Building the Benchmark with Agents

Here’s how the developer structured the agent system:

Agent 1: Test Case Generator Takes a difficulty parameter and generates a set of gravity-well configurations. Outputs structured JSON describing each scenario.

Agent 2: Model Runner Takes a test case, formats it into the appropriate prompt, and calls each target model. Handles retries, logs latency, and captures raw outputs.

Agent 3: Scorer Parses each model’s output, computes the correct answer using a physics formula, and assigns a score. For free-form outputs, a secondary LLM-as-judge call scores reasoning quality.

Agent 4: Aggregator Collects all scored results, groups by model and difficulty tier, and writes a structured report.

This four-agent pipeline ran end-to-end on a batch of 200 test cases across six models in under 30 minutes — something that would have taken days to do manually.

How to Build a Multi-Agent Evaluation System

Here’s a practical walkthrough for building your own custom benchmark system using agents.

Step 1: Define What You’re Actually Testing

Start with your specific use case, not a generic capability. Write down:

What does a good output look like?
What are the most common failure modes you care about?
Can you compute a ground truth, or do you need LLM-as-judge?

If you can’t define “correct” clearly, your benchmark will produce noisy, hard-to-interpret results.

Step 2: Build Your Test Case Library

Generate or curate at least 50–100 test cases per evaluation category. Include:

Easy cases (~30%): The model should get these right. Used to filter out completely broken behavior.
Medium cases (~50%): This is where meaningful differentiation happens.
Hard cases (~20%): Edge cases, adversarial inputs, or high-complexity scenarios.

Store test cases in a structured format (JSON or a spreadsheet) with fields for input, expected output, difficulty tier, and category tags.

Step 3: Set Up Your Model Runner Agent

The runner agent’s job is simple but important: it needs to be consistent. Use identical prompts, identical temperature settings, and identical context windows across all models. Any variation introduces noise.

Key decisions to make:

Will you test with the same system prompt across all models, or model-specific system prompts?
Will you run each test case once, or multiple times to account for temperature variance?
How will you handle model-specific quirks (like different context limits or output format preferences)?

Running each case 3–5 times with a non-zero temperature and averaging scores gives more reliable results than a single-shot run.

Step 4: Design Your Scoring Logic

Match your scoring method to your output type:

For structured outputs (JSON, YAML, code):

Validity check: Does it parse?
Correctness check: Do the values match expected output?
Format compliance: Does it follow the schema?

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

For open-ended reasoning:

LLM-as-judge with a rubric (accuracy, completeness, coherence, brevity)
Have your judge model score on a 1–5 scale with a brief rationale
Run judge scoring with a stronger model than the ones being evaluated when possible

For factual recall:

Exact match or fuzzy string match
Normalize formatting before comparison

Step 5: Log Everything

Every model call, every output, every score. You want to be able to:

Reproduce any result
Trace a failure back to the specific input
Compare runs across time as models update or prompts change

A simple schema: {test_case_id, model, run_id, timestamp, input, raw_output, parsed_output, score, scorer_notes, latency_ms}

Step 6: Build the Reporting Layer

Your aggregator agent should produce at minimum:

Overall score per model
Score breakdown by difficulty tier
Score breakdown by category
Pass/fail rate on structured output tasks
Average latency per model
Failure mode summary (common error types)

A sortable table in a shared doc or Airtable view is enough to start. You don’t need a dashboard on day one.

Common Pitfalls in Custom LLM Evaluation

Even well-designed evaluation systems run into predictable problems. Here are the ones worth knowing about before you start.

Prompt Sensitivity Without Normalization

Small prompt changes can produce large score swings. If you’re comparing models, make sure you’ve normalized prompts as much as possible — or explicitly test prompt robustness as part of your benchmark.

Judge Model Bias

LLM-as-judge approaches have a known issue: models tend to prefer outputs that look like their own style. If you’re using GPT-4 as a judge, it may systematically favor GPT-4 outputs. Use multiple judge models or cross-validate against human ratings on a sample.

Overfitting the Benchmark to One Model

It’s tempting to iterate your benchmark until your preferred model scores well. Resist this. Your test cases should be derived from real use cases, not from model behavior.

Ignoring Latency and Cost

A model that scores 5% better but costs 3x more and is twice as slow may not be the right choice. Include cost-per-query and p95 latency in your reporting from the start.

Not Versioning Your Benchmarks

If you change your test cases or scoring logic between runs, the results aren’t comparable. Version your benchmarks the same way you version code.

Where MindStudio Fits for Teams Building Evaluation Pipelines

Building a multi-agent evaluation system from scratch requires stitching together API calls, retry logic, result storage, and reporting — a lot of infrastructure before you even get to the interesting part.

MindStudio’s visual builder lets you construct this entire pipeline without managing that infrastructure yourself. You can build each evaluation agent as a separate workflow, chain them together, and run the whole system on a schedule or via webhook.

Practically, this means you can:

Build a test case generator agent that outputs structured JSON to an Airtable base
Set up a model runner agent that pulls from that base, calls multiple models (from MindStudio’s library of 200+ models, including Claude, GPT, Gemini, and others), and writes results back
Connect a scorer agent that runs LLM-as-judge evaluation against a rubric
Trigger a reporting agent on a schedule that compiles results to a Google Sheet or Notion doc

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

No separate API keys. No separate accounts for each model. The multi-model access alone removes a significant chunk of setup time when you’re running comparative benchmarks.

For teams that want to go deeper, MindStudio also supports custom JavaScript and Python functions — useful if your scoring logic involves computing physics formulas, parsing domain-specific formats, or running validations that can’t be expressed as prompts.

You can start building for free at mindstudio.ai.

If you’re looking to go further with agent orchestration, check out how MindStudio handles multi-agent workflow building and AI automation for business processes.

Frequently Asked Questions

What is LLM benchmarking and why does it matter?

LLM benchmarking is the process of systematically evaluating language models against a set of defined tasks and quality criteria. It matters because different models have different strengths — a benchmark tells you which model is actually best for your specific use case, rather than which one has the highest general-purpose score. Public benchmarks provide a starting point, but custom benchmarks give you task-specific signal that’s far more actionable.

How do AI agents help automate LLM evaluation?

AI agents can automate the repetitive, structured work involved in running evaluations: generating test inputs, calling multiple models with identical prompts, parsing and scoring outputs, and aggregating results. A multi-agent system can run hundreds of test cases across multiple models in parallel and log everything automatically — work that would otherwise require hours of manual effort per evaluation cycle.

What is LLM-as-judge and when should you use it?

LLM-as-judge is a technique where you use a language model to score the outputs of other language models against a rubric. It’s most useful when outputs are open-ended and don’t have a single correct answer — like evaluating reasoning quality, coherence, or helpfulness. The main risk is judge bias: models tend to prefer outputs that match their own style. Mitigating strategies include using multiple judge models, calibrating against human ratings, and being explicit in your scoring rubric.

How many test cases do you need for a reliable benchmark?

For most use cases, 100–500 test cases per evaluation category gives you statistically meaningful results. Fewer than 50 makes it hard to draw confident conclusions, especially if some test cases have high variance. For high-stakes decisions (like choosing a model for a production system), aim for 200+ cases per category and run each multiple times if you’re using non-zero temperature settings.

What’s the difference between a custom benchmark and a public one?

Public benchmarks (like MMLU, HumanEval, or BIG-Bench) are standardized, widely used, and good for comparing models at a general capability level. Custom benchmarks are designed around your specific tasks, prompts, and quality criteria. They’re less useful for broad comparisons but far more useful for making deployment decisions. The ideal approach is to use public benchmarks as a first filter and custom benchmarks for final evaluation.

Can you benchmark smaller or open-source models the same way?

Yes. The same framework applies regardless of whether you’re evaluating closed-source frontier models or open-source models you’re running locally. The main difference is infrastructure: locally hosted models may have different latency profiles and rate limits. If you’re benchmarking open-source models, include inference cost (compute + hosting) in your reporting alongside accuracy metrics — it’s often a significant factor in the final model choice.

Key Takeaways

Public benchmarks measure general capability. Custom benchmarks measure whether a model works for your specific tasks — and that gap matters.
A multi-agent evaluation system (test case generator → model runner → scorer → aggregator) can automate the bulk of the evaluation work.
Domain-specific benchmarks like the gravity-well approach are effective because they’re computable, scalable in difficulty, and resistant to training data leakage.
LLM-as-judge scoring is powerful for open-ended outputs but needs calibration to avoid systematic bias.
Log everything from the start — raw outputs, scores, latency, and run metadata — so you can reproduce results and compare across time.

If you want to build your own evaluation pipeline without managing the infrastructure, MindStudio gives you a visual way to wire together multi-model agents, connect to data stores, and run automated workflows. It’s free to start and takes a fraction of the setup time of a custom-coded system.