How to Build Self-Improving AI Skills with Binary Evals and Claude Code

The Problem With Manually Tuning AI Skills

Every AI builder eventually hits the same wall. You have a skill — a prompt, a chain, a structured output template — that almost works. It handles 80% of cases well, but the edge cases are annoying, inconsistent, or just wrong. So you tweak the prompt. Test it manually. Tweak again. Test again.

That cycle is slow, subjective, and hard to scale. You’re essentially doing quality control by eyeballing outputs, which means the quality of your AI skill is limited by how much time you’re willing to spend manually reviewing it.

Binary evals and Claude Code offer a different approach. Instead of tweaking manually, you write explicit pass/fail assertions about what good output looks like, then let an autonomous agent run the improvement loop for you — including overnight while you sleep.

This guide walks through the exact process: what binary evals are, how to write them, and how to wire Claude Code into a self-improving loop that iterates your AI skills until they pass every test.

What Binary Evals Are (And Why Simple Beats Complex)

An evaluation (eval) in AI development is a test that measures whether an AI system’s output meets a defined standard. You’ve probably seen scored evals: rating outputs 1–10, measuring BLEU scores, or running human preference rankings.

Binary evals are simpler. Each assertion returns exactly one of two values: true or false. Either the output passed the test, or it didn’t.

Why Binary?

Hermes, walked through line by line — free 1-hour workshop

The appeal comes down to three things.

Clarity. A scored eval gives you a 7 out of 10, and you’re left asking whether that’s good enough. A binary eval tells you “this output failed to include a date” — unambiguous, actionable.

Automability. Pass/fail logic is trivial to automate. You can chain dozens of binary assertions, run them programmatically, and get a clear overall status for any given output.

Debuggability. When a binary eval fails, you know exactly which criterion wasn’t met. When a score drops from 8.2 to 7.4, you often don’t know why.

What a Binary Eval Looks Like in Practice

Suppose your AI skill generates a weekly status report. Your binary assertions might be:

Does the output contain a section labeled “Blockers”? → true/false
Is the total word count under 400? → true/false
Does the output include at least one action item? → true/false
Is the output valid Markdown? → true/false
Does the output avoid filler phrases like “circle back”? → true/false

Each of these is a function that takes the AI’s output as a string and returns a boolean. Simple, fast, composable.

def has_blockers_section(output: str) -> bool:
    return "## Blockers" in output or "**Blockers**" in output

def under_word_limit(output: str) -> bool:
    return len(output.split()) < 400

def has_action_item(output: str) -> bool:
    return "action item" in output.lower() or "- [ ]" in output

Nothing fancy. No ML models required — just deterministic string checks that work reliably every time.

When You Need an LLM Judge

Some qualities can’t be checked with string matching. Is the tone professional? Is the reasoning coherent? For these, you can use an LLM as a judge — pass the output to a fast, cheap model with a simple yes/no prompt: “Does the following output maintain a professional tone? Answer only YES or NO.”

This is still binary (yes/no), but it handles semantic quality checks that regular code can’t. Use LLM-as-judge evals sparingly — they add latency and cost — but they’re valuable for catching hallucinations, off-topic responses, or tone drift.

How the Self-Improving Loop Works

The core architecture is straightforward:

You have an AI skill (a prompt, a structured template, a chain of instructions).
You have a suite of binary assertions defining what “good” looks like.
Claude Code runs the skill against a set of test inputs and evaluates each output.
For any failing assertions, Claude Code identifies the pattern and modifies the skill.
It runs the evals again.
It keeps looping until all assertions pass — or until it hits a maximum iteration count.

This is test-driven development applied to prompt engineering. You write the tests first, then let an autonomous agent write and rewrite the “code” (the skill/prompt) until the tests pass.

The key insight: Claude Code doesn’t need you in the loop. Once you’ve set up the assertions and pointed it at the skill file, it can run for hours — trying different prompt phrasings, adding explicit formatting instructions, adjusting output constraints — without any human input.

What “Overnight” Actually Means

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY

✓Designed the data model

✓Picked an auth scheme — sessions + RBAC

✓Wired up Stripe checkout

✓Deployed to production

Live at yourapp.msagent.ai

The overnight framing is genuine. Claude Code’s agentic mode is designed for extended autonomous runs. You can kick off an improvement loop before you go to sleep, set a reasonable iteration cap (say, 50 loops), and wake up to a version of your skill that has been refined against your actual test cases — repeatedly, without fatigue.

Total improvement time scales with iteration count and assertion complexity. Simple formatting issues often resolve in 2–3 iterations. Harder semantic problems — like ensuring outputs always recommend a concrete next step — might take 10–20.

Prerequisites: What You Need Before You Start

Before setting up the loop, make sure you have these in place:

Claude Code installed. Claude Code runs as a CLI tool. You’ll need an active Anthropic API subscription and the Claude Code package installed in your environment.

A target AI skill. This is the prompt, template, or agent configuration you want to improve. It should already live in a file — a .txt prompt file, a .json config, a Python function, whatever format your stack uses.

A set of test inputs. Binary evals need examples to run against. Create 15–30 representative inputs covering the range of cases your skill handles. Include edge cases.

A testing harness. A simple script that: (a) passes each test input to your skill, (b) collects the outputs, (c) runs each binary assertion, and (d) returns a pass/fail summary. This can be 50 lines of Python.

Version control. Set up git for the skill file. Claude Code will be modifying it repeatedly. You want to be able to roll back to any previous version and see exactly what changed.

Step 1: Define Your AI Skill as a File

The self-improving loop works best when your skill is a discrete, editable artifact. If your skill is a single system prompt, save it as skill_prompt.txt. If it’s a more complex configuration, save it as skill_config.json or skill.py.

Claude Code needs to be able to read the skill, understand it, and write a new version. A single file with clear structure makes this reliable.

For a prompt-based skill, your file might look like this:

# Weekly Status Report Generator

You are a project management assistant. When given a list of tasks
and updates, produce a weekly status report.

Format requirements:
- Use Markdown
- Include sections: Summary, Completed This Week, In Progress, Blockers, Next Steps
- Keep total length under 400 words
- Use plain, professional language

Task list:
undefined

The undefined placeholder gets filled in by your testing harness when running evals.

Step 2: Write Your Binary Assertions

Open a file called evals.py. Write one function per assertion. Each function takes the AI output as its first argument and returns True or False.

def has_required_sections(output: str) -> bool:
    required = ["## Summary", "## Completed", "## Blockers", "## Next Steps"]
    return all(section in output for section in required)

def within_word_limit(output: str) -> bool:
    return len(output.split()) <= 400

def is_valid_markdown(output: str) -> bool:
    has_heading = output.strip().startswith("#") or "\n#" in output
    no_raw_html = "<div" not in output and "<span" not in output
    return has_heading and no_raw_html

def no_filler_phrases(output: str) -> bool:
    banned = ["synergy", "leverage", "circle back", "touch base"]
    return not any(phrase in output.lower() for phrase in banned)

def has_concrete_next_step(output: str) -> bool:
    import re
    pattern = r"(by|before|owner:|assigned to:|\d{1,2}/\d{1,2})"
    section = output.split("## Next Steps")[-1] if "## Next Steps" in output else ""
    return bool(re.search(pattern, section, re.IGNORECASE))

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Then write a runner that aggregates all results:

def run_eval_suite(output: str) -> dict:
    evals = [
        has_required_sections,
        within_word_limit,
        is_valid_markdown,
        no_filler_phrases,
        has_concrete_next_step,
    ]
    return {fn.__name__: fn(output) for fn in evals}

def all_pass(output: str) -> bool:
    return all(run_eval_suite(output).values())

Step 3: Build the Testing Harness

Your testing harness ties the skill to the evals. It loads the current skill, runs it against each test input, and prints a clear failure report.

import anthropic, json

client = anthropic.Anthropic()

def run_skill(skill_prompt: str, task_input: str) -> str:
    filled = skill_prompt.replace("undefined", task_input)
    message = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": filled}]
    )
    return message.content[0].text

def evaluate_skill(skill_path: str, test_inputs: list) -> dict:
    with open(skill_path) as f:
        skill = f.read()

    results = {"pass_count": 0, "fail_count": 0, "failures": []}

    for i, test_input in enumerate(test_inputs):
        output = run_skill(skill, test_input)
        eval_results = run_eval_suite(output)
        failed = [name for name, passed in eval_results.items() if not passed]

        if failed:
            results["fail_count"] += 1
            results["failures"].append({
                "test_input_index": i,
                "failed_assertions": failed,
                "output_sample": output[:300]
            })
        else:
            results["pass_count"] += 1

    return results

Save this as harness.py. This gives Claude Code a structured, readable report of what’s failing and where.

Step 4: Configure Claude Code’s Improvement Loop

Now write a CLAUDE.md file — the configuration file Claude Code reads to understand its task. This is where you define the autonomous loop.

# Skill Improvement Task

## Your Goal
Improve the AI skill in `skill_prompt.txt` until it passes all
binary assertions in `evals.py` across all test inputs.

## Workflow
1. Run `python harness.py` to get current eval results
2. Read the failure report carefully
3. Open `skill_prompt.txt` and identify what changes would fix
   the failing assertions
4. Edit `skill_prompt.txt` to address those failures
5. Run `python harness.py` again to check progress
6. Repeat until all assertions pass
7. When all pass, print "IMPROVEMENT COMPLETE" and stop

## Rules
- Only modify `skill_prompt.txt`. Never change `evals.py` or
  `harness.py`.
- Make one logical change at a time
- After each edit, run the harness before making another change
- Add a brief comment at the top of the skill file noting
  what iteration you're on and what changed
- Maximum iterations: 40. If not resolved by then, stop and
  write a summary of remaining failures and why they're stuck

Save this in your project root. Then start Claude Code:

claude -p "Follow CLAUDE.md to improve the skill. Run autonomously."

Claude Code will read the configuration, start running the harness, and begin iterating. Watch the first few cycles to confirm it’s working as intended, then leave it.

Step 5: Review Results in the Morning

When you return, you’ll find a modified skill_prompt.txt, a git history showing every iteration, and a final harness run showing all assertions passing.

Review the changes Claude Code made. Common patterns you’ll see:

Added explicit format instructions — “Always use ## Blockers as the exact section header, not **Blockers** or Blockers:”
Added negative constraints — “Do not use business jargon including synergy, leverage, or circle back”
Tightened length guidance — “Aim for 250–350 words total” instead of a vague “be concise”
Added few-shot examples — When format compliance is persistently failing, Claude Code sometimes adds a worked example directly to the prompt

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

These are exactly the kinds of specific, targeted changes that take humans hours of manual iteration to discover. The binary evals force precision — Claude Code can’t handwave around a failing assertion the way a human reviewer might.

Common Mistakes to Avoid

Writing assertions that are too easy. If your assertions only check that the output is non-empty or contains at least one word, the loop converges fast but the skill hasn’t meaningfully improved. Write assertions that reflect real quality requirements.

Writing assertions that conflict. If one assertion requires outputs under 200 words and another requires five detailed sections, Claude Code will loop indefinitely. Check that your assertions are mutually satisfiable before running.

Using too few test inputs. Fewer than 10 inputs and the skill can overfit — it passes your specific tests but fails on production inputs. Include diverse examples: short tasks, long tasks, ambiguous tasks, inputs with unusual formatting.

Forgetting to lock the evals. Your CLAUDE.md must explicitly prohibit changes to evals.py. If Claude Code can modify the eval file, it’ll “fix” failures by making the tests easier, not the skill better.

No iteration cap. Open-ended loops get expensive if something is genuinely unsolvable. Always set a maximum.

Where MindStudio Fits Into Self-Improving Skill Pipelines

The binary eval loop is excellent at refining a skill’s prompt. But it leaves a gap: once your skill is improved, how does it run at scale — integrated with the tools and data sources your team actually uses?

That’s where MindStudio’s Agent Skills Plugin fits. The @mindstudio-ai/agent npm SDK lets Claude Code — and any other autonomous agent — call over 120 typed capabilities as simple method calls: agent.searchGoogle(), agent.sendEmail(), agent.runWorkflow(), agent.generateImage(), and more. The SDK handles rate limiting, retries, and authentication, so your improvement loop stays focused on reasoning about prompt quality rather than infrastructure.

During the improvement loop itself, this opens up richer testing. Instead of only static test inputs, you can wire agent.searchGoogle() into your harness to fetch fresh real-world examples, or use agent.runWorkflow() to test how your refined skill performs inside a live MindStudio automation workflow.

After improvement, your refined skill can be deployed as a MindStudio workflow — running on a schedule, triggered by emails or webhooks, connected to Slack, Notion, HubSpot, or any of the 1,000+ integrations available on the platform. You’re not maintaining a local script indefinitely — you’re promoting a battle-tested skill into an environment that handles production concerns for you.

You can start building for free at mindstudio.ai.

Frequently Asked Questions

What exactly is Claude Code?

Claude Code is Anthropic’s agentic coding tool that runs in your terminal. Unlike a standard chat interface, it has direct access to your file system, can run shell commands, execute code, and operate autonomously over extended periods. It understands full codebase context, can make multi-file edits, run tests, and iterate based on output — without step-by-step human guidance. It’s designed specifically for tasks that require chaining multiple actions together, which makes it well-suited for the improvement loop described in this guide.

How are binary evals different from traditional unit tests?

Traditional unit tests verify that code behaves correctly given specific inputs. Binary evals verify that AI outputs meet qualitative criteria — the right sections are present, word limits are respected, banned phrases are absent, tone is appropriate. The mechanics are similar: both return pass/fail. But the subject matter is different. You can combine them: use unit tests to verify the scaffolding code around your skill, and binary evals to verify the AI outputs themselves.

Can this work for skills beyond text generation?

Yes. The same pattern applies to any AI output you can evaluate programmatically. Code generation skills can be tested against a Python test suite — does the generated code pass pytest? Data extraction skills can be tested against schema validation — is the output valid JSON matching a defined schema? Classification skills can be tested against labeled examples. As long as you can write a function that takes the output and returns true/false, binary evals apply.

How many test inputs do I actually need?

For most skills, 15–30 test inputs is a practical sweet spot. Too few (under 10) and you risk overfitting — the skill passes your tests but fails on real inputs. Too many (over 100) and each improvement iteration becomes slow and expensive. Focus on coverage: include typical cases, clear edge cases, and at least a few examples that specifically target each binary assertion you care about.

Is it safe to let Claude Code run autonomously overnight?

For the improvement loop described here, yes — with the right guardrails. Lock down what files Claude Code can modify (your CLAUDE.md should be explicit about this). Set an iteration cap. Use a separate git branch so you can review all changes before merging. If you’re cautious about unattended runs, watch the first 5–10 iterations to confirm the loop is behaving as expected, then step away.

Can I use other models instead of Claude Code for the improvement loop?

The loop itself is a pattern, not a tool requirement. You could implement it with any sufficiently capable coding agent. That said, Claude Code is well-suited here because of its strong instruction-following in agentic settings, its ability to reason about prompt quality, and its native file system access. If you’re already using another agent in your stack, adapt the CLAUDE.md config to that agent’s equivalent configuration format.

Key Takeaways

Binary evals — true/false assertions about AI output quality — are more actionable and automatable than scored metrics.
The self-improving loop pairs these evals with Claude Code’s autonomous mode: write tests, let the agent iterate the skill until everything passes.
The overnight capability is real: set an iteration cap, start the loop before bed, wake up to a refined skill with a full git history of changes.
Strong evals require good test inputs — 15–30 diverse examples covering typical cases and edge cases.
Common mistakes (conflicting assertions, too few inputs, no iteration cap, unlocked evals) are easy to avoid if you think through your eval design before running.
Once a skill is refined, platforms like MindStudio let you deploy it at scale with scheduling, integrations, and infrastructure already handled.

If you want to deploy the skills you’ve refined — or build the entire eval pipeline visually without writing code — MindStudio is worth a look. The free tier is fully functional and gets a workflow running in well under an hour.