What Is Andrej Karpathy's AutoResearch Applied to Claude Code Skills?

The Core Idea: Teaching AI to Grade Its Own Work

The most valuable thing about Andrej Karpathy’s approach to automated research isn’t the tooling — it’s the mindset shift. Instead of evaluating AI outputs manually and adjusting prompts by hand, you build a system where the evaluation loop runs itself.

Applied to Claude Code skills, this means treating individual coding behaviors as measurable units, defining what “good” looks like in binary terms, and letting the system discover improvements while you sleep.

This article breaks down how that works: what AutoResearch is, how binary assertions create clean feedback signals, how eval.json structures the process, and how to wire everything into an autonomous overnight improvement cycle.

The three technical pillars are:

Binary assertions — simple pass/fail checks that define what “working” means
eval.json — a structured file that tracks performance across iterations
Overnight improvement cycles — scheduled loops that run evals, analyze failures, propose changes, and re-test

What Karpathy’s AutoResearch Framework Is

Karpathy has consistently argued that the bottleneck in AI development isn’t model capability — it’s evaluation. If you can’t measure whether something got better, you can’t reliably make it better.

AutoResearch is a direct application of this principle. Rather than a human sitting in the loop reviewing outputs and nudging prompts, you define measurable success criteria upfront, automate the evaluation, and let an LLM analyze its own failures and propose corrections.

Hermes Crash Course — free 1-hour live workshop

The analogy is to the scientific method: generate a hypothesis, test it, analyze results, form a revised hypothesis. The difference is that an AI agent can run this cycle dozens of times overnight, in parallel, without losing attention or context.

The core loop looks like this:

Define a skill (narrow, specific, executable)
Write binary assertions that define success
Run the agent against test cases
Log which assertions fail
Pass failure data to an LLM that proposes a prompt improvement
Apply the change and re-test
Repeat

What makes this “automated” rather than just “assisted” is that a human is only involved at the beginning — defining the skill and criteria — and at the end, reviewing what improved. The inner loop runs without supervision.

Claude Code Skills as Improvable Units

Claude Code is Anthropic’s terminal-based coding assistant. It reads, writes, runs, and modifies code directly in your environment. But “good at coding” isn’t a single skill — it’s a collection of discrete behaviors, each of which can be defined, measured, and improved independently.

Some examples of skills worth isolating:

Writing unit tests that achieve meaningful coverage
Refactoring without breaking existing test suites
Generating accurate docstrings from function signatures
Identifying and fixing type errors in TypeScript
Producing commit messages that conform to conventional commit standards
Summarizing diff output into plain-English descriptions

Each of these is small enough to evaluate cleanly. A test suite either passes or it doesn’t. A commit message either follows the spec or it doesn’t.

Why Granularity Matters

If your skill definition is too broad — “write good code” — evaluation becomes subjective and the improvement signal disappears. But if you define “write a Jest test for this function that covers the happy path, a null input case, and one boundary case,” you can check all three assertions mechanically.

The tighter the skill definition, the more actionable the eval loop becomes. Broad definitions produce noise. Narrow ones produce signal.

This is also consistent with how MindStudio approaches building AI agents — breaking complex behavior into composable, testable units rather than prompting one giant system to “do everything well.”

Binary Assertions: Why Pass/Fail Beats Scoring

The natural instinct when evaluating AI output is to score it. “This response was a 7 out of 10.” The problem: scores are noisy, subjective, and hard to act on automatically. What’s the difference between a 6 and a 7? How does an agent reliably optimize toward a higher number?

Binary assertions eliminate this ambiguity. An assertion is a statement that is either true or false:

The generated code compiles without errors → true/false
All generated unit tests pass → true/false
The output matches the expected JSON schema → true/false
No functions exceed 50 lines → true/false
The output includes at least three distinct test cases → true/false

Binary outcomes are:

Automatable — a script evaluates them without human review
Composable — you can combine assertions into a suite
Debuggable — when a run fails, you know exactly which check failed and why
Stable — the same input consistently produces the same evaluation result

Writing Effective Assertions

A useful binary assertion has three properties:

Unambiguous — it can be evaluated by a script or a secondary LLM judge with a clear yes/no answer
Directly tied to the skill — it measures the actual capability, not a proxy
Failure-informative — when it fails, the failure tells you something useful about what went wrong

Avoid assertions that require subjective judgment (“the code is readable”). Prefer assertions tied to objective, executable outcomes (“pylint scores this file above 8.0” or “the function has no parameters with single-character names”).

You don’t need many assertions per skill. Three to five well-chosen binary checks outperform a dozen vague ones. Quality over quantity matters here.

Designing Your eval.json Structure

The eval.json file is the backbone of the AutoResearch loop. It tracks everything: the skill definition, test cases, assertion logic, and the full history of results across iterations.

A minimal eval.json structure looks like this:

{
  "skill_id": "generate_unit_tests",
  "skill_description": "Generate Jest unit tests covering the happy path, null input, and one boundary case for a given TypeScript function",
  "version": 3,
  "assertions": [
    "output_compiles",
    "tests_include_happy_path",
    "tests_include_null_case",
    "tests_include_boundary_case",
    "all_generated_tests_pass"
  ],
  "test_cases": [
    {
      "input": "function add(a: number, b: number): number { return a + b; }",
      "expected_assertions_pass": ["output_compiles", "tests_include_happy_path"]
    }
  ],
  "run_history": [
    {
      "run_id": "run_001",
      "timestamp": "2025-01-15T02:00:00Z",
      "prompt_version": "v1",
      "pass_rate": 0.60,
      "failed_assertions": ["tests_include_boundary_case", "tests_include_null_case"],
      "failure_notes": "Model consistently skips edge cases when the function signature appears simple"
    },
    {
      "run_id": "run_002",
      "timestamp": "2025-01-16T02:00:00Z",
      "prompt_version": "v2",
      "pass_rate": 0.85,
      "failed_assertions": ["tests_include_boundary_case"],
      "failure_notes": "Null case now covered; boundary case inconsistent for numeric inputs"
    }
  ]
}

What to Track in Each Run

For each eval run, capture:

Timestamp — when the run executed
Prompt version — which version of the skill prompt was used
Pass rate — percentage of assertions that passed across all test cases
Failed assertions — the specific checks that failed
Failure notes — a brief analysis of the failure pattern (this can itself be generated by an LLM)

Over time, run_history becomes a clear record of what changed and whether it helped. You can see which assertions are consistently problematic and whether the skill is still improving or has plateaued.

Managing Multiple Skills

If you’re running AutoResearch on several skills simultaneously, keep a separate eval.json per skill and maintain a top-level manifest:

{
  "skills": [
    {"id": "generate_unit_tests", "current_pass_rate": 0.85, "status": "improving"},
    {"id": "write_conventional_commits", "current_pass_rate": 0.97, "status": "stable"},
    {"id": "refactor_without_regressions", "current_pass_rate": 0.71, "status": "active"}
  ]
}

This gives you an at-a-glance overview of where active improvement is happening versus where skills have stabilized.

Building the Overnight Improvement Cycle

With binary assertions and eval.json in place, the loop itself follows four stages:

Stage 1: Run Evals

The orchestrating agent runs Claude Code against each test case using the current prompt version. For each case, it checks every assertion and logs pass/fail results to eval.json. Nothing changes at this stage — you’re only collecting data.

Stage 2: Analyze Failures

The agent passes the failure data to an LLM with a structured prompt:

“Here are the failing assertions from the last eval run on the skill ‘generate_unit_tests.’ The prompt used was: [prompt]. The most common failures were: [failed assertions with examples]. Analyze the failure pattern and propose one specific change to the prompt that would address these failures.”

The LLM generates a hypothesis. For example: “The prompt doesn’t explicitly require edge cases. Add a line specifying that at least one null input test and one boundary value test must be included.”

Stage 3: Apply the Change

Hermes, walked through line by line — free 1-hour workshop

The agent updates the prompt according to the LLM’s recommendation and increments the prompt version in eval.json.

Some teams also run an A/B comparison here — testing both the old and new prompt against the full test suite and only committing the change if the new version improves pass rate by a meaningful margin (5+ percentage points is a reasonable threshold).

Stage 4: Re-Test and Log

The agent re-runs the eval with the new prompt and records results in run_history. If pass rate improved, the new prompt becomes the active version. If it regressed, the change is rolled back and the failure is logged as context for the next analysis pass.

The cycle then sleeps until the next scheduled run.

Scheduling the Loop

Schedule the full cycle to start after you finish work — 11 PM is a common choice. By morning, you have six to eight iterations of prompt improvement logged in eval.json, with a detailed history of what changed and why.

Most implementations find that three to five meaningful improvements per night is realistic for a well-defined skill. Skills typically plateau around 85–95% pass rate; the remaining failures usually reflect edge cases in the test suite or fundamental model limitations on that specific task.

Practical Skills Worth Running AutoResearch On

Not every Claude Code behavior is worth building an eval loop for. The highest-value candidates are skills you use frequently, that are currently inconsistent, and that are easy to define with binary assertions.

Test generation — Coverage quality varies. Binary assertions on test count, branch coverage percentage, and whether generated tests actually pass make this highly evaluable.

Code review comments — Assertions like “comment identifies at least one potential bug,” “no comments are purely stylistic without explanation,” and “all suggestions include a rationale” can be evaluated by a secondary LLM judge reliably.

Error message rewriting — Given a cryptic error, Claude should produce a human-readable version. Assertions: “new message identifies the cause,” “new message suggests a fix,” “new message is under 50 words.”

SQL query optimization — Given a slow query, Claude proposes an optimization. Assertions: “original and optimized queries return identical results on sample data,” “EXPLAIN shows reduced row scans.”

Dependency update summaries — When updating packages, Claude should summarize breaking changes. Assertions: “summary mentions version numbers,” “summary flags breaking changes,” “summary is accurate against changelog.”

These are all narrow enough to evaluate cleanly and common enough that improving them compounds over time.

Where MindStudio Fits Into This Workflow

Building the orchestration layer for an AutoResearch loop — scheduling runs, logging to eval.json, triggering LLM calls, managing retries, sending notifications — is genuinely tedious to build from scratch. This is where MindStudio’s Agent Skills Plugin makes a practical difference.

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

remy.msagent.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

The plugin is an npm SDK (@mindstudio-ai/agent) that gives Claude Code — or any other AI agent — access to over 120 typed capabilities as straightforward method calls. Instead of wiring up infrastructure yourself, the agent calls agent.runWorkflow() to trigger an eval run, agent.searchGoogle() to pull relevant documentation during failure analysis, or agent.sendEmail() to notify you when a skill hits its target pass rate.

For the AutoResearch loop specifically, you can build the eval orchestration as a MindStudio background agent — one that runs on a schedule, reads your eval.json files, calls Claude to analyze failures, applies prompt changes, and logs results without any manual triggering. MindStudio handles scheduling, retries, and logging; Claude Code handles the reasoning and code generation.

This is a clean division of labor. Claude Code is good at reasoning about code and generating improvements. MindStudio is good at running reliable, scheduled multi-step workflows. The combination means you can stand up an AutoResearch loop without building your own scheduler, error handling, or workflow state management.

You can try MindStudio free at mindstudio.ai.

Frequently Asked Questions

What is Karpathy’s AutoResearch, exactly?

AutoResearch refers to using AI agents to run iterative improvement loops autonomously — generating hypotheses, testing them, analyzing results, and feeding that analysis back into the next iteration without human intervention at each step. Karpathy has discussed this concept in the context of LLMs doing automated experimentation: the scientific method, but running continuously and at machine speed. Applied to Claude Code skills, it means the agent evaluates its own outputs, identifies failure patterns, and proposes better approaches automatically.

Why use binary assertions instead of LLM-as-judge scoring?

LLM-as-judge scoring (asking an LLM to rate outputs on a scale of 1–10) is useful for some tasks but introduces noise into an improvement loop. Binary assertions produce cleaner signals: a test either passes or doesn’t. This makes it much easier to detect whether a prompt change actually improved the skill versus just generating a slightly different score due to model temperature or phrasing variation. For technical skills like code generation, assertions based on actual execution results are almost always more reliable than subjective scoring.

How many test cases do I need in eval.json for meaningful results?

A reasonable starting point is 10–20 test cases per skill. Fewer than 10 and you risk overfitting your prompt to a small sample. More than 50 slows down iteration cycles significantly without proportionate signal gain. Prioritize diversity: include simple inputs, complex inputs, edge cases, and examples drawn from your actual use cases. Diversity matters more than raw count.

Can AutoResearch work without overnight scheduling?

Yes. Overnight scheduling is a convenience, not a requirement. You can run the loop interactively during the day or trigger it manually. The overnight framing is useful because it sets expectations: you define the skill and assertions in the afternoon, the loop runs unattended, and you review results in the morning. But the same loop can complete in under 20 minutes if your test suite is small and your compute is fast.

What happens when a skill plateaus at a low pass rate?

A plateau below 80% usually means one of three things: your test cases include examples that are genuinely too hard for the current model; your assertions are ambiguous enough that failures are inconsistent; or your skill definition is too broad. The right fix is to audit failing test cases manually, determine which failures represent real deficiencies versus unrealistic expectations, and narrow either the skill scope or the assertions accordingly. A well-scoped skill on a current model like Claude Sonnet or Opus should reach 85–90% fairly reliably.

Does AutoResearch require Claude Code specifically?

No. The framework applies to any AI coding agent. Claude Code is a natural fit because it operates natively in a code execution environment and follows structured instructions reliably. But the same principles — binary assertions, eval.json tracking, autonomous improvement cycles — work with GPT-based agents, custom agents built on MindStudio, or any other system capable of generating and executing code. The framework is model-agnostic; the implementation just needs to support structured input/output and code execution.

Key Takeaways

AutoResearch replaces manual iteration with a structured eval loop: define the skill, set binary assertions, run autonomously, review results — not individual outputs.
Binary assertions are the foundation — they produce clean, automatable signals rather than noisy scores. Three to five well-chosen checks per skill is enough.
eval.json is your single source of truth — it tracks skill definitions, test cases, assertion results, and improvement history across every run.
Overnight cycles compound quickly — a skill starting at 60% pass rate can reach 85–90% after a few nights of automated iteration.
Granularity determines usefulness — narrow skill definitions produce actionable eval results; broad ones produce ambiguous signals that are difficult to improve against.

If you want to run AutoResearch loops without building all the orchestration infrastructure yourself, MindStudio’s background agents and Agent Skills Plugin give you scheduling, workflow management, and integrations out of the box. Try it free at mindstudio.ai.