How to Build Self-Improving AI Skills That Get Better With Every Use
Add a learning loop to your Claude Code skills so they refine themselves over time. Here's the pattern: feedback requests, rules files, and iteration.
The Problem With AI Skills That Stay Static
Most Claude Code skills work the same way on day one as they do on day one hundred. You define the process, the agent follows it, and nothing changes. That’s fine for simple, stable tasks. But for anything that involves judgment — tone, structure, depth, format — a static skill is always going to plateau.
The fix is a learning loop: a pattern where each run of a Claude Code skill produces feedback, that feedback updates a persistent file, and the next run starts from a slightly smarter baseline. This is what separates a skill that’s useful from a skill that compounds over time.
This guide walks through how to build that loop from scratch. You’ll need three things: a mechanism to collect feedback after each run, a rules file or learnings file that stores what the skill has figured out, and an iteration step that incorporates those learnings into future runs.
What a Self-Improving AI Skill Actually Looks Like
Before getting into implementation, it helps to understand what “self-improving” means in concrete terms.
A standard Claude Code skill runs a process, produces output, and stops. The agent has no memory of what it did, no record of whether the output was good, and no way to adjust its approach next time.
A self-improving skill adds a layer after the main output. After the task completes, the skill either prompts for feedback or runs an automated evaluation. That evaluation result — pass/fail, score, or qualitative note — gets written to a persistent file. On the next run, the skill reads that file before it starts working.
Over time, the file accumulates specific, actionable knowledge: what formats worked, what language to avoid, what edge cases to handle differently. The skill isn’t just running the same instructions every time. It’s running instructions that have been refined by real usage.
This is what the learnings loop pattern is built around.
Step 1: Set Up Your Skill Structure
Start with a clean skill architecture. Your skill directory should have:
/my-skill
skill.md # The process steps — what the skill does
learnings.md # What the skill has learned from past runs
rules.md # Standing constraints that don't change
eval.json # Optional: test cases for automated evaluation
The skill.md file describes the process. The learnings.md file is where knowledge accumulates. The rules.md file holds constraints that you set manually and don’t want overwritten.
Keep these files separate. Your skill.md should only contain process steps — it’s not the place for learned preferences or standing rules. When you mix those together, the skill becomes hard to maintain and the learning mechanism loses clarity.
Create Your Learnings File
Start with a minimal learnings.md:
# Learnings
## What works
(to be populated)
## What to avoid
(to be populated)
## Edge cases
(to be populated)
This file will grow on its own. Your job is just to make sure the skill reads it at the start and writes to it after each run.
Reference the Learnings File in Your Skill Instructions
In your skill.md, add a step at the beginning:
1. Read learnings.md before starting. Apply any relevant findings to this run.
And a step at the end:
N. After completing the task, review the output and any feedback. Update learnings.md with any new findings.
That’s the core loop. Everything else builds on top of it.
Step 2: Build the Feedback Collection Mechanism
There are two ways to collect feedback: manually (you tell the skill what worked) or automatically (the skill evaluates itself using defined criteria). Most setups use both.
Manual Feedback Requests
The simplest approach: at the end of each run, the skill asks a specific question.
In your skill.md, include an instruction like:
After outputting the result, ask the user: "Was this output useful? What, if anything, should be done differently next time?"
Then instruct the skill to write whatever the user says into learnings.md under a timestamped entry.
This is low-tech, but it works well for skills where quality is subjective. You’re building a running log of preferences that the skill can reference on future runs. The learnings.md file approach keeps everything in plain text, which means it’s easy to read, edit, and version-control.
Automated Evaluation with Evals
For more rigorous self-improvement, you can add automated evaluation. This means defining a set of pass/fail tests that the skill runs against its own output before it closes.
The eval lives in eval.json:
{
"evals": [
{
"description": "Output contains a clear headline",
"type": "binary",
"check": "Does the output contain a headline in the first line?"
},
{
"description": "Word count is between 300 and 500",
"type": "binary",
"check": "Is the word count between 300 and 500?"
}
]
}
After each run, Claude evaluates the output against each test and records the results. Tests that fail consistently become entries in learnings.md — a signal that the skill needs to adjust its approach.
Binary evals are more reliable than subjective scoring for this kind of automated loop. “Did it include a headline?” is a clear yes or no. “Was this engaging?” requires judgment that can vary. Start with binary tests, then layer in subjective feedback from real usage.
If you want a full walkthrough of building evals for this pattern, the practical guide to writing evals for AI agents covers it without assuming engineering background.
Step 3: Write the Rules File
The learnings file captures what the skill discovers on its own. The rules file captures what you decide manually — constraints that should never be overridden by the learning loop.
A rules file for AI agents works like standing orders. You write them once and they apply to every run, regardless of what the learnings file says.
A simple rules.md might look like:
# Rules
- Always write in second person
- Never use the phrase "it is important to note"
- Output must be in markdown format
- Do not include pricing information
- Keep examples under 3 sentences
Reference the rules file in your skill.md alongside the learnings file:
1. Read rules.md. These constraints apply to every run and cannot be overridden.
2. Read learnings.md. Apply relevant findings to this run.
The distinction matters. Learnings can evolve — if the skill repeatedly gets feedback that a certain format isn’t working, it should update. Rules don’t evolve through the learning loop. They’re decisions you’ve made as the person building the skill.
Step 4: Implement the Wrap-Up Step
The wrap-up step is what actually makes the skill self-improving. Without it, feedback gets collected but never used again.
At the end of your skill.md, add a dedicated wrap-up block:
WRAP-UP:
1. Review the output from this run
2. Check any feedback provided (user input or eval results)
3. Identify one to three specific findings — things that worked, things that didn't, edge cases encountered
4. Open learnings.md
5. Add the findings as dated entries under the appropriate section
6. Do not remove existing entries unless they directly contradict the new findings
The instruction to not delete existing entries is important. Early in a skill’s life, learnings can seem contradictory. You want to preserve the history. As the skill matures, you can periodically review and consolidate the file manually.
This pattern — run, evaluate, write, read next time — is what the compounding knowledge loop is built on. Each run makes the skill slightly more calibrated. Over dozens of runs, the difference is significant.
Step 5: Test the Loop With a Real Run
Once the structure is in place, run the skill and watch what happens.
The first run will produce generic output — the learnings file is empty. After the run, either provide manual feedback or let the eval run. Check that learnings.md was updated. Then run the skill again and see if it incorporates what was written.
A few things to verify:
- Does the skill actually read
learnings.mdat the start? Add an explicit step like “Confirm you have read learnings.md by summarizing the most relevant finding.” - Does the wrap-up step produce useful entries, or vague ones? Vague entries like “be clearer” don’t help. Push for specific ones like “when summarizing technical content, use numbered lists instead of prose.”
- Is anything being overwritten that shouldn’t be? Review the file after three or four runs.
If the skill is not reading or writing the learnings file correctly, the problem is usually in the skill.md instructions. Make the steps more explicit. Claude follows instructions precisely — if the instruction says “consider updating learnings.md,” it might not do it. If it says “open learnings.md and append the following,” it will.
Step 6: Add Automated Improvement With Evals
Once manual feedback is flowing, you can layer in the AutoResearch-style evaluation loop. This is where the skill starts to improve itself without you in the loop.
The pattern works like this:
- The skill runs and produces output
- The skill evaluates its own output against
eval.json - Tests that fail get flagged in
learnings.mdwith a note on what went wrong - The next run starts with knowledge of that failure and tries a different approach
- If the new approach passes the eval, that finding gets reinforced in
learnings.md
This is a simplified version of what Andrej Karpathy’s AutoResearch pattern looks like applied to Claude Code skills. The underlying idea is the same: run, score, iterate, accumulate.
The key constraint is that evals have to be well-defined. If your tests are vague, the feedback they generate is vague, and the learnings are useless. Spend time on the eval design before automating the loop. Building reliable AI skill tests requires more upfront thought than most people expect — but it pays off once the loop is running.
Step 7: Manage the Learnings File Over Time
After twenty or thirty runs, learnings.md will get long. Some entries will be redundant. Some will be outdated. A few will contradict each other.
Build a periodic review into your workflow. Every two weeks or after a significant change in how you use the skill, open learnings.md and:
- Remove entries that are no longer relevant
- Consolidate duplicates into single, cleaner entries
- Move anything that has become a permanent preference into
rules.md - Note the date of the review at the top of the file
This keeps the file useful. A bloated learnings file can slow the skill down and introduce noise — the agent has more to read and more conflicting signals to reconcile.
You can also build a wrap-up skill that handles this consolidation automatically. That’s a separate skill whose only job is to review and clean up the learnings files across your other skills. It’s worth building once you have three or more self-improving skills running.
Common Patterns and Use Cases
The learning loop works across a wide range of Claude Code skill types. Here are a few concrete applications.
Content and Marketing Skills
A skill that writes ad copy, blog posts, or email subject lines benefits enormously from accumulated feedback. After each piece, you note what performed well and what didn’t. Over time, the skill develops a specific voice calibrated to your audience. Self-improving marketing skills built this way outperform static prompts because they’re shaped by real results, not assumptions.
Research and Summarization Skills
Skills that pull and summarize information can track which formats are most useful, which sources tend to be reliable, and which question types need more depth. The learnings file becomes a record of research preferences.
Data Processing and Reporting Skills
For skills that format, clean, or analyze data, evals are especially useful. You can define exact structural requirements as binary tests and let the skill self-correct until it consistently passes. If you want to go further, you can use AutoResearch to optimize business metrics autonomously using the same underlying pattern.
A/B Testing Skills
Skills that generate variations for testing can track which variants performed better and use that data to inform future generations. This is how self-improving A/B testing agents work — they don’t just generate options, they remember what worked.
Where Remy Fits
Remy is built around a different problem — compiling a spec into a full-stack application — but the underlying idea connects directly.
In both cases, the source of truth is a structured document, not the output. For Claude Code skills, the skill.md file is the source of truth. For Remy, the spec is. Both approaches mean that when you want to improve something, you update the document and the output follows.
The difference is scope. Claude Code skills handle individual tasks within a workflow. Remy handles the full application — backend, database, auth, frontend, deployment. If you’re building a tool to run your self-improving skill system — a dashboard for reviewing learnings, a UI for approving feedback, an interface for managing your skill library — Remy is a practical way to build it fast.
You describe the application in a spec, Remy compiles it into a real full-stack app. No scaffolding, no stitching together infrastructure. You can try it at mindstudio.ai/remy.
FAQ
How is a learnings file different from a rules file?
A rules file contains constraints you set manually that apply to every run. A learnings file contains findings the skill discovers through use — what worked, what didn’t, what edge cases came up. Rules are stable. Learnings evolve. Both are read at the start of each run, but only the learnings file gets written to during the wrap-up step.
How many runs does it take before the skill noticeably improves?
Most skills show meaningful improvement after five to ten runs, assuming feedback is specific. Vague feedback like “make it better” won’t help. Feedback like “the second paragraph was too long and the conclusion was missing a clear next step” will. The more precise the feedback, the faster the skill calibrates.
Can you build this without an eval.json file?
Yes. Manual feedback alone is enough to run the learning loop. The eval.json file adds automation — it lets the skill evaluate itself without waiting for a human. But it’s optional, especially early on. Start with manual feedback, get the loop working, then add automated evals once you know what good output looks like for your use case.
What happens if the skill writes conflicting learnings?
This is normal, especially early in a skill’s life. The skill will try to reconcile conflicts when it reads the file, but you’ll get better results if you review and clean up learnings.md periodically. When entries contradict each other, either remove the older one or consolidate them into a single, clearer finding. Moving settled preferences to rules.md also helps reduce noise over time.
Can you chain multiple self-improving skills together?
Yes, and it’s a common pattern. Each skill in the chain has its own learnings file. Outputs from one skill become inputs to the next. The learnings accumulate independently, but they can reference each other. If you want to understand how chaining Claude Code skills into full workflows works in practice, that’s a good place to start.
Does this work with scheduled or automated runs?
It does. Skills that run on a schedule — daily reports, weekly summaries, automated research tasks — benefit especially from the learning loop because they run without human oversight. The skill collects its own feedback via evals, updates the learnings file, and applies those learnings on the next scheduled run. Self-improving agents with scheduled tasks use exactly this pattern.
Key Takeaways
- A self-improving Claude Code skill needs three components: a feedback mechanism, a persistent learnings file, and a wrap-up step that writes findings after each run.
- Keep
rules.mdandlearnings.mdseparate. Rules are constraints you set. Learnings are what the skill discovers through use. - Binary evals are more reliable than subjective scoring for automated improvement loops. Define clear pass/fail criteria before automating.
- The
skill.mdfile should only contain process steps. Reference the learnings and rules files from there — don’t embed them in the main skill instructions. - Periodic cleanup of
learnings.mdis necessary. Bloated or contradictory learnings introduce noise. Review the file regularly and move stable preferences intorules.md. - The loop compounds. Improvement is slow at first and accelerates as the learnings file becomes more specific and accurate.
If you’re building tools to support this kind of workflow — a dashboard, a management interface, a review system — try Remy to build them without starting from scratch.