How to Build Self-Improving AI Skills in Claude Code

Why Most AI Skills Stay the Same No Matter How Many Times You Use Them

Most AI skills are static. You build them, you deploy them, and they produce roughly the same output in month six as they did on day one. That’s a problem, because your understanding of what “good output” actually looks like gets sharper with every use.

Self-improving AI skills in Claude Code close that gap. Instead of running in isolation and forgetting everything, they capture what worked, what didn’t, and what to do differently next time. The skill compounds knowledge across runs rather than starting fresh every time.

This guide covers how to build that feedback loop from scratch — the files you need, the patterns that hold it together, and the eval structures that tell your skill whether it’s actually improving.

What “Self-Improving” Actually Means in This Context

A self-improving skill isn’t retraining a model. It’s not fine-tuning. It’s something simpler and more practical: a skill that reads its own history before it runs, updates that history after it runs, and uses what it learned to do better the next time.

The mechanism works in three parts:

A learnings file — a persistent document the skill reads at the start of every run, containing notes from previous executions.
A feedback mechanism — either human rating, binary evals, or an automated scoring system that determines whether a run went well.
A write-back step — a process that translates that feedback into updated notes in the learnings file.

If you’ve read about the learnings loop in Claude Code, this is the concrete implementation behind that concept.

The Core Files You Need

Building self-improving skills in Claude Code requires a clear file structure. Here’s what you’re working with.

skill.md

This is your skill definition. It describes the process steps the skill follows — not the context, not the background knowledge, just the sequence of operations. Keeping your skill.md focused on process steps rather than cramming in reference material is what makes the whole system composable.

learnings.md

This is the memory file. It lives alongside your skill.md and gets read at the top of every run. It contains observations from past executions: what patterns produced good results, what edge cases tripped the skill up, what the model tends to get wrong.

A basic learnings.md looks like this:

# Learnings

## What works
- Including the target audience explicitly in the prompt improves specificity
- Providing 2-3 examples of desired tone reduces generic outputs

## What to avoid
- Vague briefs produce vague content — always ask for clarification first
- Outputs exceeding 800 words consistently score lower on review

## Open questions
- Still unclear whether bullet-point format or prose performs better for B2B contexts

The key thing is that this file isn’t static. It changes after every run where something noteworthy happened.

eval.json

If you want more rigorous feedback than human rating alone, you need an eval file. This defines a set of binary tests the skill runs against its own output before finalizing. Building self-improving AI skills with binary evals is a reliable way to score output quality without depending on subjective review.

A basic eval.json looks like this:

{
  "evals": [
    {
      "name": "has_clear_headline",
      "description": "Output contains a headline in the first line",
      "type": "binary"
    },
    {
      "name": "word_count_within_range",
      "description": "Output is between 400 and 800 words",
      "type": "binary"
    },
    {
      "name": "includes_call_to_action",
      "description": "Output ends with a specific action for the reader",
      "type": "binary"
    }
  ]
}

Each assertion returns pass or fail. The skill logs the results, and that log becomes data the learnings system uses to identify patterns.

Step-by-Step: Building the Feedback Loop

Here’s how to wire these pieces together into a working self-improvement cycle.

Step 1: Create the Learnings File

Start with an empty (or minimal) learnings.md. You don’t need to prefill it with guesses about what will work. The file will grow from real experience.

Step 2: Update Your skill.md to Read From It

At the top of your skill.md, add a step that reads the learnings file before executing:

## Process

1. Read `learnings.md` and internalize any notes relevant to this task
2. Execute the main task using those learnings as context
3. After completion, evaluate the output against `eval.json` criteria
4. If any evals fail or the output was flagged for review, append a note to `learnings.md` describing what happened and what to try next time

This is the loop. The skill reads its own history, uses it, then writes back to it.

Step 3: Define Your Evals

Write your eval.json with assertions that are truly binary — questions with a clear yes or no answer. The difference between binary assertions and subjective evals matters a lot here. Subjective criteria (“is the tone good?”) can’t be scored reliably by the model itself. Binary criteria can.

Good binary assertions:

Does the output contain a specific required element?
Is the output within the target length range?
Does it avoid a specific banned phrase or format?
Does it include a required section?

Bad binary assertions:

Is this content engaging?
Is the quality high?
Does this feel appropriate?

Step 4: Add a Wrap-Up Step

After the skill runs and evals are scored, you need a dedicated step to process the results. This is sometimes called a wrap-up skill — a separate skill or skill step that handles learnings updates so the logic is clean and separated from the main task.

The wrap-up step:

Reviews the eval results from this run
Compares them to historical results in learnings.md
Identifies any new patterns (consistently failing the same eval, for example)
Writes a concise observation to learnings.md

Keep the observations short and actionable. One to three sentences per run is enough. The file shouldn’t become a novel — it should be a useful reference.

Step 5: Run, Review, and Let It Compound

The first few runs won’t look dramatically different. The system needs a few cycles to generate meaningful learnings. After five to ten runs, you’ll start seeing the skill adjust its approach based on what’s accumulated.

This is what the compounding knowledge loop in Claude Code actually looks like in practice — not a dramatic jump in quality, but a steady accumulation of context that shifts how the skill operates over time.

Using Human Feedback as Your Eval Signal

Binary evals work well when you can define clear quality criteria upfront. But sometimes the best signal is simpler: you read the output and decide whether it was good.

You can build this into the loop too. At the end of a run, add a prompt that asks for a rating or a short note. Something like:

Rate this output (1-5): 
If anything was off, note it briefly:

That rating and note get appended to learnings.md. The skill reads it on the next run and adjusts accordingly.

This is closer to how building a Claude Code skill that learns from every run works in practice — feedback doesn’t have to be automated to be useful.

The AutoResearch Pattern: Scaling This Up

If you want to go further than a basic feedback loop, look at the AutoResearch pattern — a system where the skill runs multiple variations of an approach, scores each one, and uses the results to update its own strategy.

Andrej Karpathy’s AutoResearch pattern applied to Claude Code skills is the origin of this idea. The basic concept: instead of running once and hoping for the best, the skill generates several candidate outputs, runs evals against all of them, picks the winner, and logs what made it win.

This produces much faster improvement than single-run feedback, but it also costs more per execution. It’s worth it for high-value, high-frequency tasks. For occasional tasks, the simpler single-run loop is usually sufficient.

The AutoResearch eval loop gives you a detailed breakdown of how to score skill quality with binary tests across multiple candidates.

Chaining Self-Improving Skills Into Larger Workflows

Individual self-improving skills are useful. But the real value comes when you chain them together. A content creation workflow might have:

A research skill that learns which sources produce better raw material
A drafting skill that learns which structures perform best
An editing skill that learns which common errors to catch

Each skill improves independently, but they also hand off context to one another. Chaining Claude Code skills into end-to-end workflows is how you turn a set of individual improving components into a system that gets meaningfully better as a whole.

The key to doing this well is being intentional about what each skill outputs and what the next skill expects. Clean handoffs mean the learnings from one skill stay relevant to its specific function rather than bleeding into work it doesn’t own.

Common Mistakes to Avoid

A few patterns consistently cause problems when building self-improving skills.

Making the learnings file too long. If learnings.md grows to hundreds of entries without any curation, the signal gets buried in noise. Add a cleanup step that periodically consolidates redundant notes or removes observations that are no longer relevant.

Using subjective evals you can’t score reliably. If the skill is asked to judge “quality” in abstract terms, the scoring will be inconsistent. Stick to binary, concrete assertions.

Not separating the learning step from the main task. If your skill tries to do everything in one step — execute, eval, and update learnings — the logic gets tangled. Keep them separate. The most common mistakes in Claude Code skills often come down to exactly this kind of architectural muddiness.

Expecting too much too fast. The first few runs are data collection. Real behavioral change shows up after enough cycles that the skill has meaningful patterns to work from.

How Remy Connects to This

Remy works at a different level than Claude Code skills — it compiles annotated specs into full-stack applications rather than orchestrating agentic workflows. But the underlying principle is the same: the source of truth should be a structured document that both humans and AI can reason about, and that document should be durable across iterations.

If you’re building systems where Claude Code skills handle repeatable workflows and Remy handles the applications those workflows support, you end up with something cohesive: spec-driven apps that run on infrastructure you own, and agent skills that get smarter every time they run.

For anyone building AI-powered applications from scratch, Remy lets you start from a spec rather than a codebase. The spec stays in sync as the project evolves — which is the same discipline that makes a good learnings.md file actually useful.

FAQ

What is a learnings.md file in Claude Code?

A learnings.md file is a persistent markdown document that a Claude Code skill reads at the start of each run. It contains observations from previous executions — what worked, what didn’t, what patterns to apply or avoid. The skill updates the file after each run, creating a feedback loop that compounds across uses.

Do self-improving Claude Code skills require human feedback?

Not necessarily. You can build a fully automated improvement loop using binary evals (defined in eval.json) that score the skill’s output against concrete criteria. The skill runs, scores its own output, and updates its learnings file based on those scores — no human input needed. That said, adding human ratings creates a stronger signal, especially early on when you’re still defining what “good” looks like.

How is this different from fine-tuning a model?

Fine-tuning changes the model’s weights through additional training. Self-improving Claude Code skills don’t touch the model at all. Instead, they maintain an external memory file (learnings.md) that provides accumulated context at runtime. It’s faster to set up, doesn’t require training data or compute, and can be updated incrementally after every run.

How long does it take for a skill to noticeably improve?

Typically five to ten run cycles before you see clear behavioral changes. The early runs are primarily data collection. Once enough observations accumulate in the learnings file, the skill starts adjusting its approach in meaningful ways. For faster improvement, use the AutoResearch pattern to generate and compare multiple candidate outputs per run.

Can I chain multiple self-improving skills together?

Yes, and this is where the architecture gets genuinely powerful. Each skill in a chain can maintain its own learnings file and improve independently, while still passing clean outputs to downstream skills. The overall system improves at every stage rather than just at one point.

What should I put in eval.json?

Define binary tests that can be answered with a clear yes or no. Good candidates include: does the output contain a required element, is it within the target length, does it avoid a prohibited format, does it include a specific section. Avoid subjective criteria like tone quality or engagement — those can’t be scored reliably without human review.

Key Takeaways

Self-improving AI skills work by maintaining a persistent learnings file that’s read before and written to after every run.
Binary evals (eval.json) give you a concrete scoring mechanism that doesn’t depend on subjective judgment.
The wrap-up step — the logic that translates eval results into learnings file updates — is where the loop closes.
Chaining self-improving skills creates compound improvement across entire workflows, not just individual tasks.
Keep learnings files concise and actionable; long, unfocused files lose their signal.
The AutoResearch pattern accelerates improvement by comparing multiple candidate outputs per run.

If you’re building full-stack applications to support these workflows, try Remy — a spec-driven development environment where the app stays in sync with how you think about it, not just how you coded it.