How to Build a Self-Improving AI Agent That Learns From Its Own Mistakes

The Problem With Agents That Don’t Learn

Most AI agents have a short memory. They run, produce output, maybe succeed or fail — and then the next run starts completely fresh. Every mistake gets repeated. Every edge case surprises them again. There’s no accumulation of knowledge across sessions.

If you’ve built a self-improving AI agent, you’ve probably hit this wall. You can improve the prompt manually, or tune the model, but the agent itself isn’t getting better on its own. It’s doing the same thing over and over with no feedback loop.

That changes when you build a self-healing agent loop — a system where the agent runs, gets scored, diagnoses its own failures, and updates its behavior for next time. This article walks through how to build one using Claude Code and a custom benchmark harness. It’s not theoretical. You can set it up today.

What a Self-Improving Agent Loop Actually Is

A self-improving agent loop has four parts:

A task — something the agent does on every run (write code, summarize text, classify inputs, generate copy)
An evaluation harness — a structured way to score the output
Diagnostic feedback — a mechanism that turns a bad score into a specific reason for failure
A learning store — a persistent file or database the agent reads at the start of each run

The agent doesn’t just run and move on. It runs, gets scored, identifies what went wrong, and writes that lesson somewhere permanent. Next time it runs, it reads those lessons first.

This is the core idea behind the AutoResearch loop — popularized by Andrej Karpathy and increasingly applied to production AI workflows. The loop makes compounding possible. Each iteration builds on the last.

Without this structure, an agent plateaus immediately. With it, the agent keeps improving for as long as you run it.

Why Most Agents Stay Stuck

Before building the loop, it’s worth understanding why agents fail to improve on their own.

The most common AI agent failure patterns fall into a few categories:

Reasoning-action disconnect — the model knows what to do but takes the wrong action anyway
Missing context — the agent lacks information it needs but doesn’t surface that gap
Prompt drift — the task evolves but the prompt doesn’t
No feedback signal — the agent completes a task but never learns whether it was good

The last one is the most fixable. Most agents simply don’t have a way to receive structured feedback. They’re not scored. There’s no harness. Nothing writes “here’s what you got wrong” anywhere the agent can read it.

That’s the gap this build fills.

Step 1: Build Your Benchmark Harness

The harness is the infrastructure that wraps your agent. It doesn’t change what the agent does — it controls how the agent runs, what inputs it receives, and how its outputs get scored.

Think of it like a testing framework. Your agent is the code under test. The harness is the test runner.

Harness engineering is increasingly recognized as a distinct discipline — one that sits above prompt engineering and context engineering. A good harness makes your agent dramatically more reliable because it removes the variables that cause inconsistent results.

A minimal harness has three components:

Input Fixtures

These are predefined test cases your agent runs against on every eval cycle. They should cover normal cases, edge cases, and known failure modes.

Store them in a fixtures/ folder as JSON files:

{
  "id": "test_001",
  "input": "Summarize this product review in one sentence.",
  "context": "...",
  "expected_properties": [
    "contains_sentiment",
    "under_30_words",
    "no_first_person"
  ]
}

The more specific your fixtures, the more useful your evals will be.

A Run Script

This script loops over each fixture, passes it to your agent, captures the output, and stores both together for evaluation.

for fixture in fixtures/*.json; do
  output=$(claude -p "$(cat $fixture)" --output-format json)
  echo "$output" >> results/run_$(date +%s).jsonl
done

A Results Store

Keep a rolling log of outputs per fixture across runs. This lets you track improvement over time, not just pass/fail on the current run.

Step 2: Write Binary Evals

Evals are how you score the agent’s output. The most reliable evals are binary — each one returns either pass or fail, with no subjective middle ground.

Binary assertions are more useful than subjective scoring for a simple reason: you can’t build a feedback loop from a score of “7 out of 10.” You can build one from a list of pass/fail checks that tell you exactly which properties the output did or didn’t have.

RWORK ORDER · NO. 0001ACCEPTED 09:42

YOU ASKED FOR

Sales CRM with pipeline view and email integration.

✓ DONE

REMY DELIVERED

Same day.

yourapp.msagent.ai

AGENTS ASSIGNEDDesign · Engineering · QA · Deploy

Good binary evals check things like:

Does the output contain required keywords or phrases?
Is the output under a specific word count?
Does the output match a required format (JSON, markdown, specific structure)?
Are there any disallowed phrases or patterns?
Does the output reference the input correctly?

Store these in an eval.json file:

{
  "evals": [
    {
      "id": "length_check",
      "description": "Output is under 50 words",
      "type": "regex",
      "check": "word_count < 50"
    },
    {
      "id": "no_first_person",
      "description": "No first-person pronouns",
      "type": "regex",
      "check": "not_contains(['I ', 'I\\'m', 'my '])"
    }
  ]
}

If you’re new to writing evals, this practical guide for non-engineers covers the patterns clearly. For a deeper look at the eval loop itself, see how the AutoResearch eval loop scores AI skill quality with binary tests.

Step 3: Generate Diagnostic Feedback

Scoring is not the same as feedback. An agent that knows it scored 60% doesn’t know what to do differently. An agent that knows “test_004 failed because the output contained first-person pronouns and exceeded the word limit” has something to act on.

This is the diagnostic layer. After each eval run, you generate a structured summary of failures — not just which tests failed, but why, and what the output looked like when they did.

A simple diagnostic prompt to Claude:

You are reviewing eval results for an AI agent.

Here are the failed tests:
[FAILED_TESTS_JSON]

For each failure:
1. Identify the root cause
2. Describe what the agent did wrong
3. Write one specific instruction the agent should follow to avoid this failure

Output as JSON with fields: test_id, root_cause, instruction

This gives you structured, actionable lessons — not just a score.

Step 4: Write Lessons to a Persistent Learning Store

The agent can’t apply what it learns if that knowledge disappears when the session ends. You need a persistent store — a file the agent reads at the start of every run.

The standard approach is a learnings.md file. It’s a plain markdown file that accumulates lessons over time:

# Agent Learnings

## Output Formatting
- Never use first-person pronouns in summaries (failed test_004 on 2026-04-14)
- Keep all outputs under 50 words unless the task explicitly requires more

## Edge Cases
- When input contains no sentiment, default to neutral tone rather than inferring
- Numeric inputs should be formatted with commas for readability

This is how Claude Code’s auto-memory mechanism works in practice. The agent isn’t just storing outputs — it’s storing reasoning updates that change how it approaches future tasks.

The learnings loop is what turns a one-time improvement into compounding progress. Each run that fails produces a new entry. Each new entry changes agent behavior on the next run. Over dozens of iterations, the agent’s default behavior improves substantially.

For a complete walkthrough of how to structure this file and hook it into a Claude Code skill, see how to build a self-learning Claude Code skill with a Learnings.md file.

Step 5: Connect the Loop

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

Now you wire it together. The full loop looks like this:

Load learnings.md into the agent’s context
Run the agent against all fixtures
Score each output with your binary evals
Diagnose failures using Claude as an evaluator
Append new lessons to learnings.md
Commit the updated file and schedule the next run

A Claude Code implementation of step 4-5 might look like:

# Run evals
node run_evals.js --results results/latest.jsonl --evals eval.json > scores.json

# Generate diagnostics and update learnings
claude -p "
You are the evaluator agent. Review these failed evals and update learnings.md.
Failed evals: $(cat scores.json | jq '.failures')
Current learnings: $(cat learnings.md)

Add only new, specific lessons. Do not repeat existing entries.
Write the full updated learnings.md content.
" > learnings.md

This is a minimal implementation. A production version would add deduplication logic, version the learnings file, and track which lessons actually improved scores over subsequent runs.

Step 6: Automate It on a Schedule

A self-improving agent that only runs when you remember to trigger it isn’t truly autonomous. The final step is putting the loop on a schedule.

GitHub Actions is a natural fit for this. You can set up a workflow that runs on a cron schedule, executes the full loop, and commits the updated learnings.md back to the repo automatically.

name: Agent Self-Improvement Loop
on:
  schedule:
    - cron: '0 2 * * *'  # Run at 2am daily

jobs:
  improve:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run eval and improvement loop
        run: bash scripts/run_improvement_loop.sh
      - name: Commit updated learnings
        run: |
          git config user.email "agent@ci.example.com"
          git config user.name "Agent Loop"
          git add learnings.md results/
          git commit -m "Auto: improvement loop $(date +%Y-%m-%d)"
          git push

You can also build self-improving AI agents with scheduled tasks using other orchestration platforms if GitHub Actions isn’t your preference.

A Note on Eval Design and Gaming

One risk worth flagging: agents can learn to game their own evals if those evals are too narrow.

If your only eval checks for word count, the agent will optimize for word count — potentially at the expense of quality. This is a real phenomenon. There are documented cases of AI models gaming benchmark tests rather than solving the underlying task.

The defense is eval diversity. Your eval suite should check multiple independent properties, some of which are harder to fake. Include at least one eval that checks a higher-order property (e.g., “does this answer correctly address the input?”) alongside the mechanical checks.

Regularly add new evals based on failure modes you observe in production. The eval suite should grow as the agent improves — otherwise the agent will hit a ceiling defined by your test coverage, not its actual capability.

How Remy Fits Into This

Remy is a spec-driven development environment that compiles annotated markdown specs into full-stack apps. If you’re building the eval harness, the learnings loop, or the scheduling infrastructure described in this guide, Remy can compile all of that from a spec document rather than requiring you to wire up every component manually.

The relevant piece here: Remy’s architecture treats code as a compiled artifact. Your spec describes what the system does — the inputs, the eval logic, the feedback format, the scheduling behavior. Remy generates the TypeScript backend, the database schema for storing run history, and the deployment config.

That means when you want to evolve the harness — add a new eval type, change how diagnostics are structured, update the scheduling logic — you edit the spec. The code follows.

For teams already building self-improving workflows in Claude Code, Remy adds a structural layer that keeps the harness itself from becoming technical debt. You can get started at mindstudio.ai/remy.

Frequently Asked Questions

What is a self-improving AI agent?

A self-improving AI agent is one that uses structured feedback from previous runs to update its own behavior over time. Rather than staying static, it collects information about what went wrong, stores it in a persistent format, and applies those lessons on subsequent runs. The result is an agent whose performance improves over many iterations without requiring manual tuning.

What’s the difference between a learnings loop and fine-tuning?

Fine-tuning updates a model’s weights through additional training — it’s expensive, slow, and requires significant data. A learnings loop is faster and lighter: it stores behavioral instructions in a file that gets injected into the model’s context on each run. You’re not changing the model; you’re changing what the model knows before it starts. The learnings loop pattern is accessible to anyone who can write a shell script, not just teams with ML infrastructure.

How do I know if my evals are good enough?

Your evals are good enough when they consistently distinguish between good and bad outputs in a way that aligns with what you actually care about. A quick test: if your agent started producing outputs that you’d consider excellent, would all your evals pass? If not, you’re missing evals. If your agent produced outputs you’d consider terrible but all evals still passed, your evals are measuring the wrong things.

Can this approach work for non-coding tasks?

Yes. The feedback loop pattern applies to any repeated task with measurable output properties: content generation, data extraction, classification, customer communication, report writing. The key is defining binary evals that capture what “good” looks like for your specific task. Self-improving systems for marketing and content creation follow exactly the same architecture described here.

How many iterations does it take to see improvement?

It depends on the task complexity and eval coverage, but meaningful improvement is usually visible within 5–10 iterations for narrow tasks. The agent should be passing more evals on each run as long as the learnings are specific and the diagnostics are accurate. If scores plateau, it usually means your evals aren’t granular enough or the failure root causes aren’t being diagnosed correctly.

Do I need a separate evaluator model or can the agent evaluate itself?

Using a separate evaluator (or a separate Claude call in evaluator mode) is more reliable than having the agent evaluate its own output in the same session. Self-evaluation in the same context window tends to be overconfident. The builder-validator chain pattern separates generation from evaluation explicitly — the generator produces output, the validator scores it independently.

Key Takeaways

A self-improving AI agent requires four components: a task, a benchmark harness, diagnostic feedback, and a persistent learning store.
Binary evals are more actionable than subjective scoring — they tell the agent exactly what to fix.
The learnings.md file is the core mechanism: it persists lessons across sessions so every run builds on the last.
Diagnostic feedback needs to produce specific, actionable instructions — not just a failure flag.
Scheduling the loop with GitHub Actions or another cron mechanism makes improvement continuous rather than manual.
Eval diversity prevents the agent from gaming its own tests — keep adding new evals as the agent improves.

TIME SPENT BUILDING REAL SOFTWARE

95%

5% Typing the code

95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

If you want to build the harness infrastructure without writing every component from scratch, try Remy — it compiles a full-stack app from a spec, including the backend logic, database, and scheduling layer your feedback loop needs.