How to Build a Self-Improving AI Skill with Eval.json and Claude Code

The Case for Automated Prompt Improvement

Most AI prompt work follows the same slow cycle: run the prompt, see something wrong, edit manually, run it again. That process is bottlenecked entirely on your availability — and it’s subjective, meaning different people editing the same prompt will optimize for different things.

The more rigorous approach, articulated clearly by Andrej Karpathy and the broader ML evaluation community, is to define what “good” looks like upfront as structured test cases, then automate the improvement loop. You write your evals first. You measure your pass rate. You improve against it. You repeat.

This guide shows you exactly how to build a self-improving AI skill using eval.json and Claude Code. You’ll set up a folder structure, write binary assertion test cases, build a simple eval runner, and configure Claude Code to refine your skill.md through an autonomous overnight loop — no human input needed between iterations.

By the time you’re done, you’ll have a measurably improved system prompt with a documented pass rate and a full log of every change made.

Core Concepts: skill.md, eval.json, and Claude Code

Before touching any files, make sure you’re clear on what each piece does and how it fits together.

skill.md

A skill.md file is your AI’s system prompt stored as markdown. The format is intentional: markdown is readable, easily diffable in version control, and straightforward for Claude Code to edit programmatically.

A minimal skill.md for a customer support classifier might look like:

# Support Ticket Classifier

You classify incoming support messages into one of three categories: BILLING, TECHNICAL, or GENERAL.

## Rules
- Reply with ONLY the category label in uppercase.
- Do not include explanations or additional text.
- If a message fits two categories, pick the more specific one.

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

remy.msagent.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

That’s all you need to start. The loop will refine the details.

eval.json and Binary Assertions

Your eval.json holds test cases that define what the skill should do. Each case includes an input and one or more binary assertions — checks that evaluate to either pass or fail. No partial credit. No judgment calls.

Binary assertions are the right default for automation because they’re deterministic (same result every run), fast (no secondary LLM calls needed), and unambiguous. Either the output contains the expected string or it doesn’t.

Claude Code

Claude Code is Anthropic’s agentic coding tool, run from the terminal. It can read files, write files, execute scripts, and reason about what to do next — all without human prompting. That’s what makes the overnight loop viable. You give it a task and a stopping condition, and it runs.

How the Loop Works

The logic is simple:

Run skill.md against all cases in eval.json
Calculate the pass rate
If the target is met, stop
Otherwise, analyze the failures and rewrite skill.md
Return to step 1

You automate steps 2–5. Claude Code handles the reasoning in step 4. You show up in the morning to a better prompt.

Set Up Your Project Structure

Create this folder layout before writing any code:

my-skill/
├── skill.md           # The system prompt being improved
├── CLAUDE.md          # Instructions for Claude Code
├── improve_skill.py   # The improvement loop
└── eval/
    ├── eval.json      # Test cases with binary assertions
    ├── run_evals.py   # The eval runner
    └── results/       # Auto-created output folder

Initialize the project:

mkdir my-skill && cd my-skill
git init
python -m venv venv
source venv/bin/activate
pip install anthropic
export ANTHROPIC_API_KEY="your-key-here"

Commit your initial skill.md before running any loops. That way you can always roll back if an iteration makes things worse.

git add skill.md && git commit -m "initial skill"

Write Your Initial skill.md

Your starting prompt doesn’t need to be perfect — in fact, a rough starting point makes the improvement loop more visible and satisfying. Aim for something that covers the task definition, output format, and the clearest edge cases. Leave the finer points for the loop.

Here’s a starting point for the classifier example:

# Support Ticket Classifier

You are a customer support classifier. Read an incoming support message and classify it.

## Categories
- **BILLING** — Charges, invoices, refunds, subscriptions, pricing questions
- **TECHNICAL** — Bugs, errors, broken features, API issues, integration problems
- **GENERAL** — Anything that doesn't fit the categories above

## Output Format
Reply with only the category name in uppercase. No other text. No punctuation.

Examples:
- "I was charged twice this month" → BILLING
- "The API keeps returning a 500 error" → TECHNICAL
- "What are your business hours?" → GENERAL

Short, clear, unambiguous about output format. This is your baseline. The loop will handle the rest.

Design Your eval.json File

Hermes Crash Course — free 1-hour live workshop

This is the most important step in the whole process. Your evals define what “good” means. If they’re too easy, the loop converges too fast and you don’t learn much. If they’re contradictory, the loop will oscillate without improving. If they don’t reflect real inputs, the loop optimizes for the wrong distribution.

The Structure of Each Test Case

Every test case needs four fields:

id — Unique, human-readable identifier
description — What behavior you’re testing
input — The user message
assertions — One or more binary checks

[
  {
    "id": "billing-charge-error",
    "description": "Clear billing question about a charge",
    "input": "I was charged $99 but I only signed up for the $49 plan.",
    "assertions": [
      { "type": "exact", "value": "BILLING" }
    ]
  },
  {
    "id": "technical-api-error",
    "description": "API authentication failure is TECHNICAL",
    "input": "I keep getting a 403 error when I try to authenticate.",
    "assertions": [
      { "type": "exact", "value": "TECHNICAL" }
    ]
  },
  {
    "id": "no-extra-text",
    "description": "Output should only be the label — nothing else",
    "input": "Where can I find my invoices?",
    "assertions": [
      { "type": "contains", "value": "BILLING" },
      { "type": "max_length", "value": 10 }
    ]
  },
  {
    "id": "ambiguous-prefer-specific",
    "description": "When an issue spans categories, pick more specific one",
    "input": "My account was downgraded and now the integrations aren't working.",
    "assertions": [
      { "type": "exact", "value": "BILLING" }
    ]
  },
  {
    "id": "general-hours",
    "description": "Business hours question is GENERAL",
    "input": "Do you offer weekend support?",
    "assertions": [
      { "type": "exact", "value": "GENERAL" }
    ]
  }
]

Assertion Types to Support

Build your runner to handle at least these:

Type	What it checks
`exact`	Output exactly equals the value (case-insensitive)
`contains`	Output contains the string
`not_contains`	Output does NOT contain the string
`max_length`	Character count ≤ value
`min_length`	Character count ≥ value
`regex`	Output matches a regular expression

You can add an llm_judge type later for cases that require semantic evaluation — but keep them sparse. They’re slower and add API cost per eval run.

How Many Test Cases to Write

Start with 20–50. That range covers core behaviors, edge cases, format constraints, and a few adversarial inputs without making each loop iteration expensive.

Seed your evals with real examples from your use case wherever possible. Synthetic test cases get you started, but production inputs reveal behavior that controlled examples never will.

Build the Eval Runner Script

Create eval/run_evals.py. This script reads skill.md, sends each test case through the API, checks assertions, and outputs a structured JSON result.

import json
import os
import re
import sys
from pathlib import Path
import anthropic

MODEL = "claude-3-5-sonnet-20241022"
SKILL_PATH = Path("skill.md")
EVAL_PATH = Path("eval/eval.json")
RESULTS_DIR = Path("eval/results")


def check_assertion(output: str, assertion: dict) -> tuple[bool, str]:
    atype = assertion["type"]
    value = assertion.get("value", "")

    if atype == "exact":
        passed = output.strip().lower() == str(value).lower()
        return passed, f"Expected exact: '{value}', got: '{output.strip()}'"

    elif atype == "contains":
        passed = str(value).lower() in output.lower()
        return passed, f"Expected to contain: '{value}'"

    elif atype == "not_contains":
        passed = str(value).lower() not in output.lower()
        return passed, f"Expected NOT to contain: '{value}'"

    elif atype == "max_length":
        passed = len(output.strip()) <= int(value)
        return passed, f"Expected length ≤ {value}, got {len(output.strip())}"

    elif atype == "min_length":
        passed = len(output.strip()) >= int(value)
        return passed, f"Expected length ≥ {value}, got {len(output.strip())}"

    elif atype == "regex":
        passed = bool(re.search(str(value), output))
        return passed, f"Expected match for regex: '{value}'"

    return False, f"Unknown assertion type: {atype}"


def run_evals() -> dict:
    client = anthropic.Anthropic()
    system_prompt = SKILL_PATH.read_text()
    evals = json.loads(EVAL_PATH.read_text())

    results = []

    for case in evals:
        response = client.messages.create(
            model=MODEL,
            max_tokens=256,
            system=system_prompt,
            messages=[{"role": "user", "content": case["input"]}]
        )
        output = response.content[0].text

        case_passed = True
        failures = []

        for assertion in case["assertions"]:
            passed, message = check_assertion(output, assertion)
            if not passed:
                case_passed = False
                failures.append(message)

        results.append({
            "id": case["id"],
            "description": case.get("description", ""),
            "input": case["input"],
            "output": output,
            "passed": case_passed,
            "failures": failures
        })

    pass_rate = sum(1 for r in results if r["passed"]) / len(results)
    output_data = {
        "pass_rate": round(pass_rate, 4),
        "passed": sum(1 for r in results if r["passed"]),
        "total": len(results),
        "results": results
    }

    RESULTS_DIR.mkdir(exist_ok=True)
    (RESULTS_DIR / "latest.json").write_text(json.dumps(output_data, indent=2))

    print(json.dumps(output_data))
    return output_data


if __name__ == "__main__":
    run_evals()

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Test it before setting up the loop:

python eval/run_evals.py

You’ll get a JSON object with the pass rate and per-test results. Note the failures — they’ll give you a baseline and often reveal something obvious about your initial prompt.

Run the Self-Improvement Loop with Claude Code

This is where the overnight automation happens.

Create a CLAUDE.md File

A CLAUDE.md file gives Claude Code persistent context about your project. Claude Code reads it automatically when it starts in your project directory. Keep it focused on what Claude needs to know:

# Skill Improvement Project

## What This Is
We're iteratively improving `skill.md` using evals. The skill classifies customer
support messages as BILLING, TECHNICAL, or GENERAL.

## How to Run Evals
Run: python eval/run_evals.py
Output is JSON. Check `pass_rate`. Failing tests are in the `results` array where `passed` is false.

## Rules
- Only modify `skill.md`. Never modify eval/eval.json or eval/run_evals.py.
- Keep the system prompt under 500 words.
- Preserve behavior for passing tests when fixing failing ones.
- After each iteration, append a one-line summary to improvement_log.md.

## Target
Stop when pass_rate >= 0.95 or after 10 iterations.

The Improvement Script

Create improve_skill.py. This calls the eval runner as a subprocess, analyzes failures with Claude, and writes an updated skill.md each iteration:

import json
import subprocess
import time
from pathlib import Path
import anthropic

client = anthropic.Anthropic()

MAX_ITERATIONS = 10
TARGET_PASS_RATE = 0.95
MODEL = "claude-3-5-sonnet-20241022"


def run_evals() -> dict:
    result = subprocess.run(
        ["python", "eval/run_evals.py"],
        capture_output=True, text=True
    )
    return json.loads(result.stdout)


def improve_skill(current_skill: str, eval_results: dict) -> str:
    failures = [r for r in eval_results["results"] if not r["passed"]]

    prompt = f"""You are improving an AI skill prompt.

Current skill.md:
---
{current_skill}
---

Pass rate: {eval_results['pass_rate']:.1%} ({eval_results['passed']}/{eval_results['total']} tests passing)

Failing tests:
{json.dumps(failures, indent=2)}

Analyze why these tests are failing. Identify root causes — one fix often resolves
multiple failures. Rewrite the skill prompt to fix the failures without breaking
passing tests.

Return ONLY the improved skill.md content. No preamble."""

    response = client.messages.create(
        model=MODEL,
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text


def log_iteration(i: int, pass_rate: float, note: str = ""):
    entry = f"- Iteration {i}: pass_rate={pass_rate:.1%} {note}\n"
    with open("improvement_log.md", "a") as f:
        f.write(entry)


skill_path = Path("skill.md")
print("Starting improvement loop...")

for i in range(1, MAX_ITERATIONS + 1):
    print(f"\n--- Iteration {i} ---")
    results = run_evals()
    pass_rate = results["pass_rate"]
    print(f"Pass rate: {pass_rate:.1%} ({results['passed']}/{results['total']})")

    if pass_rate >= TARGET_PASS_RATE:
        print(f"Target reached. Stopping.")
        log_iteration(i, pass_rate, "TARGET REACHED")
        break

    current_skill = skill_path.read_text()
    improved = improve_skill(current_skill, results)
    skill_path.write_text(improved)
    log_iteration(i, pass_rate)

    print("skill.md updated. Pausing before next iteration...")
    time.sleep(2)
else:
    final = run_evals()
    print(f"\nMax iterations reached. Final pass rate: {final['pass_rate']:.1%}")

Running Overnight

Start it in a tmux session so it survives terminal closure:

tmux new -s skill-loop
source venv/bin/activate
python improve_skill.py
# Detach: Ctrl+B then D

Or with nohup:

nohup python improve_skill.py > improve_output.log 2>&1 &
echo $! > loop_pid.txt

Check results in the morning:

cat improvement_log.md
cat eval/results/latest.json | python -m json.tool | head -30
git diff skill.md  # see all changes since initial commit

Letting Claude Code Handle the Whole Thing

If you have Claude Code installed, you can hand it the task directly:

claude "Read CLAUDE.md for context. Run the eval loop: execute python eval/run_evals.py, analyze failures, improve skill.md, and repeat until pass_rate >= 0.95 or 10 iterations. Log each iteration to improvement_log.md. Never edit eval.json."

This gives you more nuanced improvement reasoning. Claude Code reads the failing tests in context, identifies semantic patterns across failures, and makes principled changes — not just mechanical rewrites.

Deploying Your Optimized Skill with MindStudio

Once the eval loop converges, you have something concrete: a system prompt with a documented pass rate, a log of every revision, and version control showing what changed between iterations. That’s a production-ready prompt.

The natural next step is giving it a production runtime. MindStudio lets you paste your optimized skill.md content directly into an AI worker’s system prompt, then connect it to real inputs — a Slack message, an email inbox, a Zendesk webhook — without writing deployment code.

MindStudio handles the infrastructure: rate limiting, retries, authentication with connected tools, and logging. Your optimized prompt gets a full runtime in minutes. For teams running evals against live skills, MindStudio’s workflow scheduling features let you trigger periodic eval runs automatically and surface regressions before they reach users.

If you’re building multiple skills in parallel — each with its own eval loop — MindStudio also gives you a unified workspace to manage all of them, monitor performance, and deploy updates without context-switching between codebases.

You can start free at mindstudio.ai.

Common Pitfalls and How to Avoid Them

A few issues come up repeatedly when people set up these loops for the first time.

Evals that are too easy. If your initial pass rate is above 85%, your test cases aren’t challenging enough. Good evals start with a pass rate around 50–70%. Add genuinely ambiguous inputs, unusual phrasings, and tricky edge cases.

Contradictory test cases. If test A expects BILLING for a message and test B expects TECHNICAL for a nearly identical one, the loop will oscillate. Review your cases for logical consistency before running anything overnight. A contradiction in your evals isn’t a prompt problem — it’s an eval problem.

Skipping version control. Run git commit skill.md before every iteration or at least before starting the loop. Loops occasionally produce regressions, and without version history you have no rollback path.

No iteration or cost limit. With 30 test cases and 10 iterations, you’re looking at roughly 300 eval API calls plus improvement calls — manageable. Without a hard cap, an unexpected loop condition can run much longer. Always set MAX_ITERATIONS.

Optimizing for evals, not production. This is the hardest one to guard against. If your test cases are too synthetic, the prompt will get very good at passing them while behaving oddly on real inputs. Seed at least 30–40% of your eval cases with actual examples from your use case.

Frequently Asked Questions

What is a binary assertion in AI evaluation?

Wondering what the Hermes hype is about? Free 60-minute primer

A binary assertion is a true/false check applied to an AI output. It either passes or fails — there’s no partial scoring. Examples include checking whether output contains a specific string, matches a regex pattern, stays under a character limit, or matches an expected value exactly. Binary assertions are preferred for automated eval loops because they’re fast, deterministic, and don’t require a secondary model call to score.

How many eval cases do I need?

Twenty to fifty is a practical starting range. That’s enough to cover core behaviors, meaningful edge cases, and format constraints without making each loop iteration slow or expensive. As the skill matures and goes into production, add cases based on real failures you observe. Evals should grow with the skill, not stay frozen at what you imagined when you first wrote it.

Can Claude Code run this loop without any human input?

Yes. In agentic mode, Claude Code can read files, execute scripts, analyze output, and write updated files — all without confirmation steps, especially when run with --dangerously-skip-permissions in a controlled local environment. Using tmux or nohup keeps the process running after you close your terminal. The key is setting a clear stopping condition upfront so the loop terminates cleanly on its own.

What’s the difference between binary assertions and LLM-as-judge scoring?

Binary assertions use deterministic code to check output. LLM-as-judge makes a second API call with a yes/no evaluation prompt, which works better for semantic quality, tone, or reasoning — things that are hard to express as a string match. The tradeoff: LLM-as-judge is slower, more expensive, and introduces some variability between runs. Start with binary assertions for everything you can. Use LLM-as-judge only for cases where code genuinely can’t capture what you care about.

What if the pass rate gets stuck and stops improving?

A stuck pass rate usually points to one of three causes: contradictory test cases (the prompt literally can’t satisfy both at once), a behavior the underlying model handles poorly regardless of how you prompt it, or a prompt that’s already at the ceiling for what binary assertions can measure. Start by reviewing your failing tests for logical contradictions. If there aren’t any, try simplifying the task definition in skill.md. If pass rate still won’t move, the issue may be the model’s capability rather than the prompt.

How long does a typical self-improvement loop take?

With 30 eval cases, each iteration takes roughly 2–5 minutes at typical API speeds, depending on model and output length. Ten iterations runs in 20–50 minutes — well within a single overnight window. Add time.sleep(1) or time.sleep(2) between API calls during eval runs to avoid hitting rate limits on heavier test suites.

Key Takeaways

Write evals before optimizing. They’re the definition of success. A skill is only as good as the criteria it’s measured against.
Binary assertions keep the loop fast and reproducible. Code-based checks don’t introduce model variability into your scoring.
Claude Code does real reasoning in the improvement step. It identifies root causes across multiple failures, not just patches individual test cases one at a time.
Always set hard limits on iterations and cost. Know roughly what the loop will cost before you start. Cap it.
Seed evals with real inputs. Synthetic cases get you started. Production examples make evals meaningfully harder — and your skill meaningfully better.

Once the loop converges, your optimized skill.md is ready for production. Deploying it in MindStudio gives you the runtime layer — integrations, scheduling, logging, and real user inputs — without additional engineering work. Keep running eval loops as your requirements evolve and you’ll have a continuous improvement process, not just a one-time optimization.