What Is the AutoResearch Eval Loop? How to Score AI Skill Quality with Binary Tests

Why Most Teams Don’t Actually Know If Their AI Skills Are Working

You ship a Claude Code skill. You test it a few times, the outputs look decent, and you move on. A week later, edge cases pile up, performance degrades, and you have no data to explain why — or how to fix it.

This is the default state for most teams building AI agents. They evaluate by eyeballing outputs, which works for a proof of concept and fails everywhere else.

The AutoResearch eval loop is a structured alternative. It applies binary yes/no tests to AI skill outputs, scores them automatically, and gives you a repeatable process for improving quality. Andrej Karpathy has pointed to evals as one of the highest-leverage things any AI developer can build. This article shows you exactly what that means in practice — and how to apply it to Claude Code skills.

What the AutoResearch Eval Loop Actually Is

The term “eval loop” comes from a simple idea: if you can measure quality, you can improve it. The AutoResearch pattern extends this by making the measurement fully automated — no humans reading outputs and deciding whether they seem okay.

The loop has four stages:

Generate — Run your AI skill against a set of test inputs
Evaluate — Apply binary tests to each output
Score — Calculate what percentage of tests pass
Iterate — Change something (the prompt, the model, the tool configuration), then run the loop again

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

“Auto” refers to automating all four stages so you can run hundreds of experiments without doing manual review. “Research” refers to the exploratory nature of the work — you’re not just verifying that something works, you’re actively discovering what configuration produces the best results.

Why Karpathy Keeps Talking About Evals

Karpathy’s core argument is that most AI developers underinvest in measurement. They spend hours tuning prompts based on vibes and then wonder why the behavior is inconsistent in production.

His point: the eval IS the product. Once you have a reliable way to measure output quality, prompt engineering becomes systematic instead of intuitive. You run an experiment, check the score, make a change, run it again. That process compounds over time in a way that gut-feel testing never does.

For Claude Code specifically, this matters because skills vary enormously in complexity. A simple “extract this data from a document” skill might work fine with minimal tuning. A skill that reasons across multiple steps, calls external tools, and formats output for downstream use needs much more rigorous evaluation.

Why Binary Tests Beat Scoring Rubrics

When teams first think about evaluating AI outputs, they often reach for rubrics: score this response from 1 to 5 on accuracy, coherence, and relevance. Rubrics feel thorough. They’re actually a mess.

The problems are predictable:

Subjectivity — A “3” on accuracy means different things to different reviewers
Aggregation — How do you combine a 4 on accuracy and a 2 on coherence into a useful signal?
Automation — Complex rubrics require human judgment, which breaks the automation requirement
Drift — Standards shift as examples accumulate and reviewers get fatigued

Binary tests sidestep all of this. Each test asks a single yes/no question about the output. The answer is either true or false. There’s no ambiguity.

What a Binary Test Looks Like

A binary test for a Claude Code skill that summarizes support tickets might look like this:

Does the summary contain fewer than 100 words? (Yes/No)
Does the summary mention the ticket’s resolution status? (Yes/No)
Is the summary free of the words “apologize” or “sorry”? (Yes/No) — if your team has a style rule
Does the output match the expected JSON schema? (Yes/No)
Does the summary avoid introducing facts not present in the source ticket? (Yes/No)

Run 50 tickets through the skill and apply all five tests to each output. If 230 out of 250 total tests pass, your skill scores 92%. Now you have a number you can track, compare, and improve.

Using a Second Model as an Evaluator

Some binary tests are easy to write programmatically: checking word count, validating JSON structure, verifying that required fields are present. Others require judgment — like “does this output contain hallucinated information?”

For those, you can use a second AI model as the evaluator. Prompt it with the original input, the skill output, and a single yes/no question. This is sometimes called LLM-as-judge. When you constrain the judge to binary answers and design questions carefully, it’s surprisingly reliable — and still fully automated.

The key is keeping each question narrow. “Is this output high quality?” is too broad. “Does this output mention a specific dollar amount that wasn’t in the source document?” is answerable.

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

Designing Binary Tests for Claude Code Skills

Good binary tests share a few properties: they’re specific, they’re grounded in observable output characteristics, and they’re tied to real failure modes you’ve seen or can anticipate.

Start With Your Failure Modes

Before writing a single test, list the ways your skill breaks. Look at past bad outputs. Talk to the people who use the skill. Common failure modes for Claude Code skills include:

Format violations — Output doesn’t match the expected structure
Scope creep — The skill does more (or less) than requested
Hallucination — Output introduces information not present in the input
Instruction following failures — The skill ignores explicit constraints (e.g., “respond only in English”)
Truncation — Output cuts off before completing the task
Verbosity — Output is complete but buried in unnecessary text

Each failure mode maps naturally to a binary test.

Define Your Test Suite Size

For a typical Claude Code skill, start with 20–30 test cases and 3–6 binary tests per case. That gives you 60–180 individual test evaluations per loop run. Enough to get meaningful signal, manageable enough to iterate quickly.

As the skill matures and you identify more subtle failure modes, expand the suite. Don’t start with a massive test suite — you’ll spend more time maintaining it than running it.

Version Your Tests

Your binary tests will evolve. When you fix one failure mode, you might expose another. Keep test files versioned alongside your skill configurations so you can track how score changes correspond to what you changed.

A simple naming convention helps: eval_v1_summary_skill.json, eval_v2_summary_skill.json. When a score drops unexpectedly, you can diff the test files and the skill configuration to pinpoint the cause.

Building the Eval Pipeline: Step by Step

Here’s how to implement an AutoResearch eval loop for a Claude Code skill from scratch.

Step 1: Define the Skill and Its Inputs

Start with a clear, written definition of what the skill is supposed to do. This sounds obvious but is frequently skipped. If you can’t write it in two sentences, the skill is probably underspecified.

Example: “This skill takes a raw support ticket (plain text) and returns a structured JSON summary containing: title (string), resolution_status (enum: resolved/unresolved/pending), and summary (string, max 80 words).”

Collect 20–30 real or realistic test inputs. These should include edge cases — short tickets, very long ones, tickets in unusual formats, tickets with ambiguous resolution status.

Step 2: Write Binary Tests for Each Quality Dimension

Map each dimension of quality (format, accuracy, completeness, constraint adherence) to specific tests. Write them before running any outputs — this prevents you from unconsciously designing tests around outputs you’ve already seen.

For the support ticket skill above:

Format tests:

Does the output parse as valid JSON? (Yes/No)
Does the JSON contain all three required fields? (Yes/No)
Is the resolution_status value one of the three valid enum options? (Yes/No)
Is the summary field 80 words or fewer? (Yes/No)

Accuracy tests (LLM-as-judge):

Does the resolution_status match what a human reviewer would assign based on the ticket text? (Yes/No)
Does the summary avoid introducing any facts not present in the original ticket? (Yes/No)

Step 3: Run the Skill and Collect Outputs

Run Claude Code with your skill definition against all 20–30 test inputs. Store the outputs alongside the inputs. At this stage, don’t read the outputs closely — you’re about to score them systematically.

Step 4: Apply Tests and Calculate Scores

Run each binary test against each output. For programmatic tests, write simple scripts. For LLM-as-judge tests, batch the evaluations with a structured prompt that returns a JSON object with boolean fields.

Aggregate scores by test category:

Format score: 87/100 tests pass
Accuracy score: 74/100 tests pass

This disaggregated view is more useful than a single combined number. If format is near-perfect but accuracy is struggling, you know exactly where to focus.

Step 5: Iterate and Re-Score

Make a single change to the skill — update the system prompt, adjust the output schema, change the model version, add a clarifying instruction. Run the loop again. Compare scores before and after.

The discipline of changing one thing at a time is what makes the loop useful. If you change three things and the score goes up, you don’t know which change did the work. If it goes down, you don’t know which change caused the regression.

Step 6: Set a Threshold and Automate

Once your skill hits an acceptable score (many teams target 85–90% pass rate across all tests), add the eval loop to CI/CD. Run it automatically whenever the skill configuration changes. Any score drop below the threshold triggers a review before the change is deployed.

This is the “AutoResearch” in action — continuous, automated measurement that keeps quality from degrading silently over time.

Common Mistakes When Running Eval Loops

Writing Tests Too Late

If you write binary tests after you’ve already seen a bunch of outputs, you’ll unconsciously write tests that the current outputs pass. The eval becomes self-confirming rather than genuinely challenging. Write tests before generating outputs, based on the skill definition and anticipated failure modes.

Using Too Few Test Cases

Twenty test cases might feel like a lot when you’re building the suite. It’s usually not enough to catch distribution-level problems — cases that only appear when the skill encounters a specific input pattern. Expand your suite incrementally as you find new failure modes.

Ignoring the LLM-as-Judge Prompt Quality

If you use a second model to evaluate outputs, the quality of that evaluation prompt matters a lot. A vague evaluator prompt produces inconsistent results that undermine the whole loop. Be as specific with your judge prompt as you are with your skill prompt. Test the judge itself on a small set of cases where you know the correct answer.

Optimizing for Tests Instead of Real Performance

It’s possible to get a high eval score by overfitting your skill to the test suite — particularly if the test cases aren’t representative of real inputs. Periodically refresh your test suite with new cases from production. The goal is to measure real-world quality, not eval performance in isolation.

How MindStudio Agent Skills Fit Into This Pattern

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

If you’re building Claude Code agents that call external capabilities — searching the web, sending emails, generating images, running workflows — you’re already dealing with a layer of complexity that pure eval loops don’t always capture well.

MindStudio’s Agent Skills Plugin (@mindstudio-ai/agent on npm) is an SDK that lets Claude Code call 120+ typed capabilities as simple method calls: agent.searchGoogle(), agent.sendEmail(), agent.runWorkflow(). Each method handles auth, rate limiting, and retries in the background.

The eval loop applies directly here. When Claude Code uses a MindStudio skill to complete a task, you can evaluate whether:

The right skill was called for the task (Yes/No)
The skill was called with the correct parameters (Yes/No)
The output from the skill was used correctly in the response (Yes/No)
The final output meets format and accuracy requirements (Yes/No)

This gives you end-to-end visibility into agent behavior — not just whether the final output looks right, but whether the agent’s reasoning and tool usage were correct.

You can also use MindStudio itself to orchestrate the eval loop. Build a workflow that runs your test inputs through Claude Code, collects outputs, applies evaluations, and logs scores to a spreadsheet or dashboard. The visual workflow builder makes this straightforward without requiring a separate eval infrastructure setup.

You can try MindStudio free at mindstudio.ai.

Frequently Asked Questions

What is an AutoResearch eval loop?

An AutoResearch eval loop is an automated quality measurement cycle for AI outputs. It generates outputs from an AI skill, applies defined tests to evaluate those outputs, scores the results, and feeds that score back into an iteration process. The goal is to replace ad hoc, subjective output review with repeatable, data-driven quality assessment.

What makes binary tests better than other evaluation methods for AI skills?

Binary yes/no tests are specific, easy to automate, and produce unambiguous results. Unlike 1–5 scoring rubrics, they don’t require human judgment calls and aggregate cleanly into a pass rate. They also force you to define quality criteria precisely, which itself surfaces assumptions about what “good” actually means for a given skill.

How many binary tests do I need for a reliable eval?

A minimum viable eval suite for a single Claude Code skill is typically 20–30 test inputs with 3–6 binary tests each. That produces 60–180 individual test evaluations per loop run — enough to detect meaningful quality differences when you make changes. Expand the suite as you discover new failure modes.

Can I use Claude or another LLM to run the evaluations?

Yes. This is called LLM-as-judge. For tests that require semantic judgment — detecting hallucinations, checking whether a response answers the question, assessing tone — a second model can evaluate outputs automatically. The key is constraining the judge to binary answers and writing narrow, specific evaluation prompts. Broad prompts like “is this good?” produce unreliable results.

How does this apply specifically to Claude Code?

Claude Code skills are discrete capabilities — tools or functions given to Claude for a specific purpose. Binary eval loops apply to any skill that produces consistent, evaluable output: document summarization, data extraction, code generation, classification tasks. The loop helps you tune prompts, test different tool configurations, and validate that skills perform reliably before deploying them in production. You can use prompt engineering best practices alongside eval loops to systematically improve performance.

How often should I run the eval loop?

Run it after every meaningful change to the skill — prompt updates, model changes, tool configuration changes. Once a skill is stable, integrate it into CI/CD so it runs automatically. Some teams also run evals on a schedule (weekly) to catch performance drift from model updates or changes in input distribution over time.

Key Takeaways

The AutoResearch eval loop — generate, evaluate, score, iterate — turns AI skill quality from a subjective judgment into a measurable, improvable metric.
Binary yes/no tests are the right unit for AI evaluation: specific, automatable, and free from the ambiguity that makes rubric scoring unreliable.
Design your test suite before generating outputs, based on real failure modes, not on what you’ve already seen.
Changing one thing per loop run is what makes the process useful. Multiple simultaneous changes make it impossible to attribute score changes.
LLM-as-judge evaluators extend binary testing to semantic quality dimensions without requiring human review.
Automating the loop as part of CI/CD prevents quality from degrading silently as skills evolve.

If you’re building Claude Code agents with external capabilities, MindStudio’s Agent Skills Plugin integrates directly with this eval pattern — and the MindStudio workflow builder can host the eval pipeline itself. Start with a small test suite, run the loop, and let the scores tell you where to spend your time.