What Is Andrej Karpathy's AutoResearch Pattern Applied to Claude Code Skills?

The Loop That Makes AI Get Better at Its Own Job

Andrej Karpathy has been making a case for something that initially sounds almost too simple: if you want an AI system to improve at a task, build a way to measure how well it’s doing, then let it run overnight.

This is the core of what’s been called the AutoResearch pattern — a structured iteration loop where an AI agent tests its own outputs against defined eval criteria, analyzes failures, proposes improvements, and re-tests. It’s the same feedback loop that makes supervised learning work, now applied to making AI skill calls more reliable.

When you apply this to Claude Code workflows — specifically to how Claude calls typed capabilities like those provided by the MindStudio Agent Skills Plugin — you get something practical: a system that can optimize its own prompt behavior while you sleep.

This article explains how that works, from setting up eval files to measuring pass rates to running the overnight improvement loop.

What the AutoResearch Pattern Actually Is

Karpathy’s thinking on autonomous research loops comes from a broader observation about how good AI work gets done: the bottleneck usually isn’t the model’s raw capability — it’s the feedback signal.

In traditional ML, you improve a model by training it on labeled data with a loss function. The loss tells the model how wrong it was, and gradient descent adjusts the weights. Run enough iterations, and the model improves.

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

The AutoResearch pattern applies the same logic to AI agents. Instead of adjusting weights, you adjust prompts, system instructions, and few-shot examples. Instead of a loss function, you use eval files — structured test cases that tell you whether the agent did the right thing.

The Three Core Components

The pattern has three parts:

Eval files — A collection of input/expected-output pairs. Each represents a task the agent should handle, and each defines what “correct” looks like.
Pass rate tracking — A metric (usually a percentage) that tells you how many eval cases the agent is getting right.
Improvement loop — The cycle of running evals, analyzing failures, proposing prompt changes, re-running, and repeating.

The point Karpathy returns to consistently: if you don’t have evals, you can’t tell whether your agent got better or worse after a change. You’re guessing. Evals turn prompt engineering from intuition into something you can actually measure.

Why Overnight Iteration Matters

The “overnight” framing is practical. Running a full eval suite takes time, and analyzing failure cases takes more time on top of that. If you structure the loop properly, you can fire it off at end of day and come back to improved system prompts in the morning.

This is only possible if the loop is autonomous — meaning the agent can read its own failures, reason about what went wrong, generate candidate improvements, test them, and select the best version without you in the loop for every step.

What Claude Code Skills Are

Before getting into the implementation, it’s worth being precise about what “skills” means in the context of Claude Code.

Claude Code is Anthropic’s agentic coding assistant. It operates inside a terminal, reads and writes files, runs shell commands, and calls external tools. Unlike a chat interface, it takes actions — not just returns text.

“Skills” in this context refers to typed, callable capabilities that Claude Code can invoke as part of completing a task. These are distinct from raw code execution. A skill might be searchGoogle(), sendEmail(), generateImage(), or runWorkflow() — named, structured actions with defined inputs and outputs.

The MindStudio Agent Skills Plugin

The MindStudio Agent Skills Plugin (@mindstudio-ai/agent) is an npm SDK that exposes 120+ of these typed capabilities to any AI agent, including Claude Code. Instead of writing boilerplate to call an API, retry on failure, handle auth, and parse the response, Claude Code can call a method directly:

await agent.searchGoogle({ query: "latest AI research benchmarks" });
await agent.sendEmail({ to: "team@company.com", subject: "Weekly report", body: content });
await agent.generateImage({ prompt: "product visualization, white background" });

The SDK handles the infrastructure — rate limiting, retries, authentication — so the agent focuses on reasoning about which skill to call and how to parameterize it.

This skill surface is what the AutoResearch pattern optimizes: not the model itself, but how Claude Code decides to call these capabilities in response to a given task.

Why Skill Output Quality Varies

If you’ve watched Claude Code work through real tasks, you’ve probably noticed that skill call quality isn’t consistent. Sometimes it picks exactly the right tool with well-formed parameters. Sometimes it picks the right tool but passes vague parameters. Sometimes it calls the wrong tool entirely.

This variation comes from a few places:

Ambiguous task descriptions — The system prompt doesn’t make clear which skill applies to which type of problem.
Under-specified parameters — Claude knows to call generateImage() but defaults to generic prompt strings when more specific ones would produce better results.
Sequencing errors — Claude calls tools in the wrong order, missing dependencies between skill outputs and subsequent calls.
Context length drift — In longer sessions, earlier instructions about skill usage get deprioritized as the context fills.

The good news: all of these are prompt-level problems. They don’t require retraining. They require better system instructions, better few-shot examples, or both.

The AutoResearch pattern is designed to find and fix these issues systematically.

Building Eval Files for Claude Code Skills

An eval file is a structured representation of a test case. For Claude Code skill calls, a test case has three parts:

Input — A task description or user request that Claude Code should respond to by calling one or more skills.
Expected behavior — What the correct skill call looks like: which skill, what parameters, in what order.
Pass criteria — How you determine whether Claude’s actual output matches expected behavior.

What an Eval File Looks Like

A simple format in JSON:

{
  "id": "eval_search_001",
  "input": "Find recent news about GPT-5 benchmarks",
  "expected": {
    "skill": "searchGoogle",
    "params": {
      "query_contains": ["GPT-5", "benchmark"]
    }
  },
  "pass_criteria": "skill_match_and_param_contains"
}

More complex evals can test multi-step sequences:

{
  "id": "eval_report_pipeline_003",
  "input": "Research quantum computing trends and email a summary to the team",
  "expected_sequence": [
    { "skill": "searchGoogle", "params": { "query_contains": ["quantum computing"] } },
    { "skill": "sendEmail", "params": { "to_contains": "@" } }
  ],
  "pass_criteria": "sequence_order_and_param_match"
}

Categories to Cover

A solid eval suite for Claude Code skills typically covers:

Tool selection accuracy — Does Claude pick the right skill for the task?
Parameter quality — Are parameters specific, well-formed, and appropriate?
Sequencing — When a task requires multiple skill calls, are they in the correct order?
Edge cases — Ambiguous inputs where the correct behavior is to ask for clarification, not guess.
Refusals — Tasks where the right answer is not to call any skill.

Aim for 20–50 eval cases when starting out. Enough to surface real failure patterns, but not so many that each iteration cycle becomes slow.

Writing Pass Criteria

Exact-match criteria work for simple cases. For real-world evals, fuzzy criteria are more useful:

Skill match — Correct skill was called, regardless of parameters.
Param contains — A parameter includes a required keyword or value.
Sequence match — Multiple skills were called in the correct order.
LLM-as-judge — A secondary model evaluates whether the output was acceptable, useful for open-ended cases where no single answer is correct.

The LLM-as-judge approach is particularly powerful for evaluating skill calls where the pass criteria aren’t easily codified as a rule.

Running the Pass Rate Loop

With eval files in place, the loop itself is straightforward.

Step 1: Run the Eval Suite

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

Execute each eval case. For each one, record what Claude Code actually did — which skill it called, what parameters it passed, in what sequence.

You can do this programmatically using a test harness built around the MindStudio Agent Skills Plugin. The plugin’s typed method signatures make it easy to wrap calls with logging.

Step 2: Calculate Pass Rate

Apply the pass criteria to each eval case. Sum the passes, divide by total cases, multiply by 100. That’s your pass rate.

Track this over time. A new system prompt is only an improvement if it raises the pass rate across the full eval suite — not just on the specific cases you tried to fix.

Step 3: Analyze Failures

This is where the AutoResearch pattern adds the most value over manual prompt iteration. Instead of reading every failure and guessing what went wrong, you let Claude analyze the failures.

Give it the full list of failing eval cases — the input, the expected behavior, and what Claude actually did — and ask it to identify patterns. What categories of errors are most common? Is Claude consistently choosing the wrong skill for a particular class of input? Are parameters consistently under-specified in one domain?

Claude is quite good at this kind of structured error analysis when given clean failure data.

Step 4: Generate Candidate Prompts

Based on the failure analysis, Claude generates candidate improvements to the system prompt. This might mean:

Adding explicit rules (“When a task involves finding information, prefer searchGoogle over runWorkflow”)
Adding few-shot examples that demonstrate correct skill selection and parameterization
Restructuring existing instructions to reduce ambiguity
Adding clarification-seeking behavior for edge cases

Generate 3–5 candidate prompts per iteration. Don’t just take the first suggestion — test them all.

Step 5: Re-Test and Select

Run each candidate prompt against the full eval suite. Calculate the pass rate for each. Keep the version with the highest pass rate, provided it doesn’t drop performance on any specific category by more than a small threshold.

Then repeat from Step 1.

Automating the Overnight Loop

The manual version is useful for understanding what’s happening. The real value of the AutoResearch pattern is that the loop can run without you.

What Full Automation Looks Like

A fully automated loop:

Loads the current system prompt and eval files
Runs Claude Code through the eval suite with instrumented skill call logging
Calculates pass rates by category
Prompts a reasoning model to analyze failures and generate candidate improvements
Tests each candidate against the eval suite
Selects the best-performing candidate
Logs the result with before/after pass rates
Optionally commits the new system prompt to version control

You can schedule this as a cron job or a background agent. Fire it off at end of day; review the output the next morning.

Guardrails to Include

Automation without guardrails tends to go sideways. A few safeguards:

Regression threshold — If pass rate drops below the current baseline on any eval category, don’t apply the change.
Change diff limit — Don’t allow the loop to make sweeping changes to the system prompt in a single iteration. Small, targeted changes are easier to reason about and easier to roll back.
Human review gate — For production deployments, log candidate prompts for review rather than auto-applying them.
Eval suite version pinning — If you update the eval files during an active loop, track which eval version each result was measured against. Otherwise your metrics aren’t comparable.

Practical Tips for Getting Started

A few things that make a real difference in practice.

Start with High-Frequency Skills

Don’t try to optimize every skill at once. Pick the 3–5 skills that Claude Code calls most often in your workflow and build evals for those first. Getting those right has the highest leverage on overall output quality.

Use Real Tasks, Not Synthetic Ones

The best eval cases come from real workflows. Keep a log of times Claude Code called the wrong skill or passed poor parameters. Those real failures become your most valuable test cases — more representative than anything you’d design from scratch.

Version Your Evals

Eval files should be versioned alongside your system prompts. If a new system prompt raises pass rates but you also changed the eval files, you can’t draw valid conclusions about what improved.

Track Pass Rate by Category

An aggregate pass rate can mask regressions. If your overall pass rate goes from 72% to 78% but your email skill accuracy drops from 90% to 60%, you have a problem. Track category-level metrics separately.

Don’t Over-Optimize for the Evals

This is the classic overfitting problem, translated to prompt engineering. If the system prompt becomes so full of specific instructions that it only works on eval cases and fails on novel inputs, you’ve defeated the purpose. Periodically add new eval cases from real-world usage to keep the suite honest.

How MindStudio Fits This Pattern

The AutoResearch pattern is architecture-agnostic — you can implement it with any AI agent framework. But MindStudio has features that make the implementation faster and more reliable.

The Agent Skills Plugin as the Eval Surface

When Claude Code uses the MindStudio Agent Skills Plugin, every skill call is a typed method with a defined signature. This makes instrumentation straightforward: you wrap the plugin’s methods with logging to capture the exact skill name, parameters, and timestamp for every call.

That logged data becomes the ground truth for your eval loop. No ambiguity about what actually happened — you have a precise record of every skill invocation.

Scheduled Background Agents for the Overnight Loop

MindStudio lets you run agents on a schedule without provisioning server infrastructure. You can create a workflow that runs nightly, pulls the latest skill call logs, compares them against eval criteria, and posts a summary report to Slack or email.

This is the “overnight” part of the overnight improvement loop — handled without a custom server, cron jobs on a VPS, or any infrastructure management.

Workflow Orchestration for Multi-Step Evals

Some evals require multi-step logic: run Claude, intercept the skill call, compare to expected, log the result, aggregate across all cases. MindStudio workflows support this kind of chaining natively, with support for custom JavaScript functions for the logic-heavy parts.

Pairing MindStudio with Claude Code gives you a loop where Claude reasons about how to improve its own skill usage, and MindStudio handles execution, logging, and reporting. You can try MindStudio free at mindstudio.ai.

Frequently Asked Questions

What exactly is Karpathy’s AutoResearch pattern?

The AutoResearch pattern is a framework for autonomous AI improvement through iterative evaluation. It applies the supervised learning feedback loop to agent development: define a measurable objective, run the agent, measure performance, analyze failures, make adjustments, repeat. Karpathy has described eval-driven development as essential for building reliable AI applications — without evals, you can’t distinguish improvements from regressions.

How is this different from standard prompt engineering?

Standard prompt engineering is largely manual and intuition-driven: write a system prompt, test it informally, adjust based on what felt wrong, repeat. The AutoResearch pattern is systematic: define pass criteria upfront, measure performance quantitatively, and use those measurements to guide improvements. One produces hunches, the other produces data you can act on.

What does a Claude Code eval file look like in practice?

An eval file defines an input task, the expected skill call (which skill and what parameter requirements), and the criteria for passing. It can be as simple as a JSON object with a task description and expected skill name, or as complex as a multi-step sequence with fuzzy parameter matching and an LLM-as-judge scoring component. Consistency matters more than the specific format.

How many evals do I need to get started?

Start with 20–30 eval cases. That’s enough to surface meaningful patterns in Claude’s skill call behavior without making iteration cycles slow. As you identify real-world failures, add them to the suite. A good eval suite grows organically from actual usage rather than being designed entirely top-down.

Can this loop run fully autonomously without human review?

Technically yes, but full automation in production carries real risk. A better default: run the loop autonomously, log proposed improvements with before/after pass rates, and have a human approve changes before they’re deployed. In development and staging environments, full automation is lower risk and a reasonable way to move faster.

What pass rate should I target?

It depends on the application. For non-critical workflows, 80–85% across your eval suite is a reasonable starting target. For anything involving external communications, financial operations, or irreversible actions, you want 95%+ before deploying. Use the pass rate as a directional metric — consistent improvement matters more than hitting a specific number.

Key Takeaways

The AutoResearch pattern applies the ML feedback loop to prompt engineering: define evals, measure pass rates, analyze failures, improve, and repeat.
Claude Code skills are typed callable capabilities — what you’re optimizing is how Claude decides which skill to call and how to parameterize it.
Eval files define test cases: input tasks, expected skill behavior, and pass criteria. Start with 20–30 cases focused on high-frequency skills.
The loop can run overnight and autonomously when built with proper guardrails, including regression thresholds and human review gates for production changes.
The MindStudio Agent Skills Plugin provides the typed capability surface that makes skill call logging and eval comparison straightforward — and MindStudio’s background agents handle the scheduling and reporting without custom infrastructure.
Track pass rates by category, not just in aggregate, to catch regressions that an overall number would hide.

If you want to build this loop without provisioning custom infrastructure, MindStudio’s background agents and workflow tools can handle the scheduling, logging, and reporting. Try MindStudio free to get started.