How to Build a Claude Code Skill That Learns From Every Run

The Problem: Skills That Don’t Remember

Every Claude Code workflow has the same blind spot. It can produce excellent output in a single session, but the moment that session ends, all the context disappears. The next time someone runs the same Claude skill — same repo, similar task — it starts completely fresh.

That means:

Mistakes from last week get repeated
Edge cases discovered during testing have to be re-documented manually
The only way to “teach” the skill something new is to update the prompt by hand or add explicit instructions to a config file

Most teams handle this by running evaluation cycles after major changes — checking outputs, refining prompts, then repeating. It works, but it’s slow, manual, and doesn’t capture the small stuff: the weird edge case in a specific file, the pattern that worked well for one type of task but not another.

A learnings loop is a more practical alternative. It instruments your Claude Code skill to automatically capture what happened after each run, stores those observations, and retrieves relevant ones at the start of future runs. The skill gets smarter over time — without any manual eval work between sessions.

This guide walks through building that loop from scratch: schema design, capture, storage, retrieval, injection, and noise control.

What a Learnings Loop Actually Does

Before building anything, it helps to be precise about what this system does and doesn’t do.

A learnings loop is not a general memory system. It doesn’t store conversation history or try to build a knowledge graph of everything the skill has ever done. That approach tends to get noisy fast and can actually hurt performance by polluting context with irrelevant information.

Instead, a learnings loop captures structured observations — specific, typed data points about what happened during a run. Each observation answers a small set of concrete questions:

What type of task was this?
What approach did the skill take?
Did it succeed or fail?
What worked well, and what didn’t?
Were there any edge cases worth flagging?

These observations are stored in a format that makes them retrievable by task type or semantic similarity. At the start of each new run, the skill fetches the most relevant past observations and injects them into the system context before Claude begins reasoning about the current task.

How This Differs From Standard Few-Shot Prompting

Few-shot prompting gives Claude examples manually. A learnings loop generates examples automatically from real production runs.

This matters because manually curated examples tend to cover the happy path. Real learnings capture edge cases, failures, and approaches that worked in non-obvious situations — exactly the information that’s hardest to maintain by hand.

Manually curated examples also go stale. Learnings tied to specific runs have timestamps, version tags, and confidence scores that let you decay or supersede them as conditions change.

Design the Learnings Schema

The schema is the most consequential design decision in this whole system. Too few fields and the observations aren’t actionable. Too many and the injection block becomes noisy.

A good learnings record answers these questions in structured form:

Core fields

task_type — a short category label (e.g., "file_refactor", "api_integration", "test_generation")
input_summary — one or two sentences describing what the task asked for
approach — a brief description of the strategy the skill used
outcome — "success" | "partial" | "failure"
confidence — a float between 0 and 1 (used for retrieval ranking)

Observation fields

what_worked — free text, can be null
what_failed — free text, can be null
edge_cases — array of strings, each describing a specific gotcha

Metadata fields

id — unique identifier
timestamp — ISO 8601
tags — array of strings for keyword retrieval
run_duration_ms — optional, useful for spotting performance regressions

Here’s a complete success record:

{
  "id": "learn_20240115_001",
  "timestamp": "2024-01-15T14:32:00Z",
  "task_type": "file_refactor",
  "input_summary": "Refactor authentication module to use JWT",
  "approach": "Decomposed into three atomic commits: extract token logic, update middleware, update tests",
  "outcome": "success",
  "confidence": 0.9,
  "what_worked": "Atomic commit structure prevented merge conflicts with parallel feature work",
  "what_failed": null,
  "edge_cases": ["Existing active sessions required a migration path not in the original spec"],
  "tags": ["refactor", "auth", "jwt"],
  "run_duration_ms": 45000
}

And a failure:

{
  "id": "learn_20240115_002",
  "timestamp": "2024-01-15T15:10:00Z",
  "task_type": "api_integration",
  "input_summary": "Add Stripe webhook endpoint",
  "approach": "Direct endpoint creation with payload parsing",
  "outcome": "failure",
  "confidence": 0.1,
  "what_worked": null,
  "what_failed": "Missed STRIPE_WEBHOOK_SECRET env check — runtime errors in staging",
  "edge_cases": ["Test and production webhooks use different signing keys"],
  "tags": ["api", "stripe", "webhooks", "env"],
  "run_duration_ms": 12000
}

Keep free-text fields under 200 characters each. Long observations don’t inject well into context and force Claude to read more than it needs.

Build the Capture Step

The capture step runs after each skill execution. Its job is to extract a structured learning from the run and hand it to the storage layer.

Where to Run It

The natural place is a post-execution hook — a function that fires when the main task completes, regardless of outcome. If you’re using a custom MCP tool, trigger capture from a cleanup function. If you’re using CLAUDE.md commands, add a dedicated “capture learning” step at the end of your command sequence.

The key principle: capture must be automatic, not optional. If it requires a human to trigger, it won’t happen consistently enough to be useful.

How to Extract the Learning

The simplest approach is to ask Claude itself to extract the learning at the end of a run. Add a final step to your skill’s execution flow with a structured extraction prompt:

You just completed the following task:
[task description]

Here is a summary of what happened:
[execution log or final state]

Extract a learning record using the following JSON schema:
[schema]

Be concise. Keep each text field under 200 characters. 
Classify outcome as "success", "partial", or "failure".

This works because Claude can assess its own reasoning — it knows what approach it took and can identify what was novel or problematic.

Handling Partial Runs

Not every run will complete cleanly. Some will time out, hit rate limits, or produce output that’s technically correct but practically wrong. The "partial" outcome captures this gap.

Partial runs are often the most valuable learnings. They record the difference between what the skill thought it accomplished and what actually got used — which is exactly the feedback that improves future runs.

Build the Storage Layer

The right storage option depends on your run volume and whether you need semantic retrieval.

Option 1: Local JSONL File

For light usage (fewer than 500 learnings), a local .jsonl file is the simplest option. Append each learning as a new line. At retrieval time, read the file and filter by task_type or tags.

import { appendFileSync, readFileSync } from 'fs';

function saveLearning(learning) {
  appendFileSync('./learnings.jsonl', JSON.stringify(learning) + '\n');
}

function getLearnings() {
  return readFileSync('./learnings.jsonl', 'utf8')
    .split('\n')
    .filter(Boolean)
    .map(JSON.parse);
}

Zero infrastructure. Easy to inspect and edit manually. Gets slow above a few hundred records.

Option 2: SQLite

For medium usage (500–10,000 learnings), SQLite gives you fast tag-based retrieval without needing a server. Use the better-sqlite3 package for synchronous reads. SQLite’s FTS5 extension adds full-text search across input_summary, what_worked, and what_failed — no extra dependencies.

Option 3: Vector Store

For large-scale usage or when task descriptions vary in wording, a vector store like Chroma lets you find relevant learnings by semantic similarity rather than exact tag matching.

Store the embedding of input_summary as the primary index. At retrieval time, embed the current task description and run a nearest-neighbor search. A cosine similarity threshold of 0.75 tends to work well in practice — low enough to catch paraphrasing, high enough to avoid irrelevant matches.

Option 4: Cloud-Backed Storage

If your skills run across multiple developers or machines, you’ll need cloud-hosted storage with a shared schema. The fastest path is building a retrieval workflow in a platform like MindStudio and calling it from your skill — more on that in the next section.

Build the Retrieval and Injection Step

Retrieval runs before each skill execution. It fetches the most relevant past learnings and formats them for injection into the current run’s context.

Retrieval Logic

For JSONL or SQLite:

Filter by task_type matching the current task
Sort by confidence descending
Take the top 5 records
Include at least one failure record if available (failures teach more than successes)

For vector stores:

Embed the current task description
Run a nearest-neighbor search with a similarity threshold
Return the top 5 results, deduplicated by approach

Formatting Learnings for Context Injection

How you format the retrieved learnings matters as much as what you retrieve. Learnings should be injected into the system prompt, not the user turn. Here’s a template that works well:

## Past Run Learnings (auto-generated)

The following observations were captured from previous runs of this skill.
Treat them as operational guidance, not hard rules.

**Relevant successes:**
- [2024-01-15] JWT auth refactor: Atomic commit structure prevented merge conflicts.
  Edge case: Existing sessions needed a migration path not in the original spec.

**Relevant failures:**
- [2024-01-15] Stripe webhook integration: Missed env secret check caused staging errors.
  Edge case: Test and production webhooks use different signing keys.

Keep the injected block under 500 tokens. If you have more than 5 relevant learnings, summarize the oldest ones into a single combined bullet rather than appending them all.

Where to Inject

In Claude Code, persistent project-level context lives in CLAUDE.md. For dynamic learnings that change run-to-run, inject at the API call level — prepend the formatted learnings block to the system prompt before each execution:

const relevantLearnings = await getLearningsForTask(taskType, taskDescription);

const systemPrompt = [
  BASE_PROMPT,
  formatLearningsBlock(relevantLearnings)
].join('\n\n');

Handle Edge Cases and Noise

Left unchecked, a learnings loop will eventually accumulate noise. Here are the main failure modes and how to address them.

Contradictory Learnings

Over time, you may accumulate learnings that conflict — “don’t use approach X” and “approach X worked well” both in your store. Handle this with a supersedes_id field. When a new learning contradicts an older one, set supersedes_id to the older record’s ID. At retrieval time, filter out superseded records.

Stale Learnings

A learning from six months ago about a library version that’s since changed may be actively harmful. Add a stack_tags field (e.g., ["react-18", "node-20"]) and filter by matching tags at retrieval. For age-based decay, multiply confidence by 0.9 for every 30 days of age. Sort retrieval by effective confidence — confidence × age_decay_factor — rather than raw confidence.

Learning Quality Degradation

If the skill performs poorly, the learnings it captures will also be poor. Monitor the outcome distribution in your store periodically. If more than 40% of recent learnings have outcome: "failure", it’s a signal that the base skill needs attention — not just more learnings.

Extending the Loop With Explicit Feedback

Automatic capture gets you most of the way there, but a simple feedback channel significantly improves learning quality.

Post-Run Confirmation

After each successful run, surface a single question to the user or calling system: “Did this output need modification before use?”

Store the answer as a human_confirmed boolean. In retrieval, prefer learnings where human_confirmed: true when building the injection block. This one field does a lot of work — it separates runs where Claude technically succeeded from runs where the output was actually useful.

Manual Learning Injection

Some knowledge doesn’t come from a run. It comes from a code review comment, a spec change, or a post-mortem note. Support manual injection by exposing an addLearning() method that accepts the same schema. Set confidence: 1.0 and human_confirmed: true for manually added records. These get prioritized in retrieval automatically.

How MindStudio Handles the Infrastructure Layer

Building the storage and retrieval layer from scratch is the heaviest part of this system. You need to manage a database, write retrieval logic, handle auth, and deal with rate limits if you’re calling external services.

The MindStudio Agent Skills Plugin offers a faster path. It’s an npm SDK (@mindstudio-ai/agent) that lets Claude Code call MindStudio capabilities as typed method calls — including agent.runWorkflow(), which invokes any workflow you’ve built in MindStudio’s visual editor.

This means you can build the learnings storage and retrieval logic as a MindStudio workflow — using Airtable, Notion, or any of its 1,000+ integrations as the backing store — and call it from your Claude Code skill in just a few lines:

import { MindStudioAgent } from '@mindstudio-ai/agent';
const agent = new MindStudioAgent();

// Save a learning after each run
await agent.runWorkflow('save-learning', { learning: learningRecord });

// Fetch relevant learnings before the next run
const learnings = await agent.runWorkflow('fetch-learnings', {
  taskType,
  taskDescription
});

The workflow handles storage logic, schema validation, and retrieval ranking. Your skill code stays clean. MindStudio manages the infrastructure layer — rate limiting, retries, auth — so the skill can focus on reasoning.

A complete learnings workflow in MindStudio can be set up in under an hour, connecting to Airtable for persistent storage and using a built-in embedding model for semantic retrieval. If you want to share a learnings store across a team, Airtable’s multi-user structure means every developer’s runs contribute to the same pool automatically.

This also pairs well with building Claude automation workflows in MindStudio more broadly — if your skills call other agents or trigger downstream processes, MindStudio’s workflow engine handles those connections too.

You can try MindStudio free at mindstudio.ai.

Frequently Asked Questions

How many runs does it take before the learnings loop has a noticeable effect?

Most skills start showing measurable improvement after 20–30 runs. The first batch of learnings establishes baseline patterns; edge case coverage grows quickly after that. For narrow, task-specific skills — like refactoring tests in a specific repo — even 10–15 learnings can capture most of the relevant operational knowledge.

Does injecting learnings significantly increase token usage?

Modestly. A well-formatted learnings block adds roughly 200–400 tokens per run, depending on how many records you inject. At typical API pricing this is negligible for most workloads. The more important constraint is context window pressure: keep the injected block under 500 tokens to leave room for complex tasks with long inputs.

What happens when the underlying codebase changes significantly?

Major changes — framework upgrades, architectural rewrites — can invalidate portions of your learnings store. The best practice is to version-tag learnings against the relevant stack (e.g., tags: ["nextjs-14", "react-18"]) and filter by matching tags at retrieval time. After a major upgrade, older version-tagged learnings are naturally deprioritized without needing to delete them.

Can this loop work with agents other than Claude Code?

Yes. The capture-store-retrieve-inject pattern is architecture-agnostic. It works with LangChain agents, CrewAI workflows, or any other agentic system that accepts a system prompt. The MindStudio Agent Skills Plugin supports Claude Code, LangChain, and custom agents as first-class targets.

How do I prevent learnings from drifting toward overconfidence?

Overconfidence happens when successful runs keep reinforcing the same approach without capturing the conditions that made it work. Guard against this by including input_summary in every learning — so retrieval can distinguish “this worked for a 200-line module” from “this worked for a 5,000-line module.” Keep confidence scores tied to specific task conditions rather than broad task categories.

For project-specific skills, a shared store works well — everyone’s runs contribute to a richer operational picture. For general-purpose skills deployed across different projects, use per-project stores to avoid cross-contamination. If you’re using MindStudio’s workflow-backed storage, scoping a store by project is a single filter condition in the workflow logic.

Key Takeaways

A learnings loop instruments your Claude Code skill to automatically capture structured observations after each run, then inject relevant ones at the start of future runs — no manual eval cycles required.
The four components are: capture (post-run extraction via Claude), store (JSONL, SQLite, or vector DB), retrieve (filtered or semantic search), and inject (formatted system context).
Keep learning records concise — under 200 characters per text field — and limit injection to 5 records and 500 tokens per run.
Handle noise proactively: supersede contradictory learnings, apply age-based confidence decay, and monitor your outcome distribution.
A human_confirmed boolean on each learning, collected via a simple post-run question, significantly improves retrieval quality over time.
MindStudio’s Agent Skills Plugin lets you offload storage and retrieval infrastructure to a visual workflow, keeping your skill code minimal while connecting to Airtable, Notion, or any other backing store.

Building this loop once means every Claude workflow run contributes to a growing body of operational knowledge. The skill gets better without anyone having to maintain it manually — which is the whole point of automation in the first place.