Claude Code Skills: How to Build Standard Operating Procedures for Your AI Agent

Why Your AI Agent Keeps Producing Inconsistent Results

You give Claude the same type of task twice. You get two completely different outputs. One is excellent. The other misses the mark. You can’t tell which one you’ll get next time.

This is the core problem with ad-hoc prompting. Every session starts from scratch. Claude doesn’t know your process, your standards, or your preferences unless you explain them again. And even then, the output depends on how well you explained things in the moment.

Claude Code skills solve this. A skill is a reusable process document — essentially a standard operating procedure (SOP) for your AI agent. Instead of re-explaining the task every time, you write the process once, store it as a file, and Claude loads it automatically when needed.

This guide explains how to build skills that actually work: what goes in them, how to structure them, and how to avoid the mistakes that make skills brittle or unreliable.

What Makes a Skill Different From a Prompt

Most people treat Claude like a smart search engine. They type what they want, read the result, and move on. That works fine for one-off questions.

But agents need something different. They need repeatable behavior across many runs, not a fresh interpretation every time.

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

Claude Code skills are Markdown files that live in your project directory. When Claude runs a task, it reads the relevant skill file and follows the steps it contains. The skill defines how the work gets done — the sequence, the standards, the output format.

Think of it this way:

A prompt tells Claude what you want right now.
A skill tells Claude how to approach a category of work, every time.

The difference is procedural permanence. A prompt is a request. A skill is a trained process.

The Anatomy of a Claude Code Skill

Every skill has a predictable structure. Understanding the pieces makes it much easier to write good ones.

The skill.md File

This is the core of the skill. It contains the process steps — nothing else. The skill.md file should only contain process steps, not background context, not examples, not general guidance. Just the ordered sequence of what to do.

A minimal skill.md looks like this:

# Blog Post Writing Skill

## Steps

1. Read the brief from `brief.md` and identify: target audience, primary keyword, article goal.
2. Check `brand-voice.md` for tone requirements and banned phrases.
3. Outline the article: intro, 5–7 H2 sections, FAQ, conclusion.
4. Write each section sequentially. Keep paragraphs under 4 sentences.
5. Confirm the primary keyword appears in the first 100 words and at least one H2.
6. Output the complete article in Markdown.

Short. Sequential. Actionable. That’s the goal.

Reference Files

Reference files hold the context that would otherwise bloat the skill.md. Your brand guidelines, style rules, example outputs, client preferences — all of this lives in separate files that the skill.md points to.

The skill loads these files when it needs them, rather than keeping everything in memory at once. This matters because context window bloat degrades performance. Context rot is a real problem: when skill files get stuffed with too much information, the agent’s outputs get fuzzier, not sharper.

The Directory Structure

A typical skill setup looks like this:

.claude/
  skills/
    blog-writing/
      skill.md
      references/
        brand-voice.md
        seo-checklist.md
        example-output.md

Keep it organized. One skill per directory. References in a subfolder.

How to Build a Claude Code Skill: Step-by-Step

Here’s the process for building a skill from scratch. These steps apply whether you’re automating content creation, code review, research, or any other repeatable task.

Step 1: Pick One Specific Task

Resist the urge to build a “general writing skill” or a “research skill.” Skills work best when they’re narrow.

Ask yourself: what is the exact output this skill should produce? If you can’t answer that in one sentence, the scope is too broad.

Good: “Write a LinkedIn post from a blog article summary.” Bad: “Help with content.”

A narrow scope means you can write precise process steps. Vague scope leads to vague steps, which leads to variable outputs.

Step 2: Map the Process Before Writing the Skill

Before opening a text editor, write down the process as you would explain it to a new team member on their first day. What do they need to read first? What do they do next? What does the output look like?

This exercise forces clarity. If you can’t describe the process in plain English, you’re not ready to write the skill yet.

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY

✓Designed the data model

✓Picked an auth scheme — sessions + RBAC

✓Wired up Stripe checkout

✓Deployed to production

Live at yourapp.msagent.ai

Step 3: Write the Process Steps

Now translate that process into your skill.md file. Follow these rules:

Use a numbered list. Order matters.
Each step should be a single, discrete action.
Reference external files for context (don’t paste content inline).
Specify the output format in the final step.

Avoid vague language. “Check the content” is not a step. “Check that all headings are in title case and no step contains more than two sentences” is a step.

Step 4: Create Your Reference Files

Go through your steps and identify every piece of context Claude needs to execute them. Each distinct type of context becomes its own reference file.

Common reference files include:

brand-voice.md — tone, style, banned phrases
output-format.md — exact structure of the deliverable
quality-checklist.md — criteria for a good output
examples/ — sample inputs and outputs the agent can pattern-match against

Keep reference files focused. A brand voice file shouldn’t also contain SEO rules. Separate files stay easier to update and debug.

Step 5: Test With Real Inputs

Run the skill with three or four real examples. Look at the outputs critically:

Did Claude follow every step?
Is the output format consistent across runs?
Are there steps that Claude interprets differently each time?

Steps that produce inconsistent behavior are usually too vague. Rewrite them until you’d be satisfied if a stranger followed them.

Step 6: Refine the Reference Files

After your first round of testing, you’ll almost always find that your reference files are either too sparse (missing critical context) or too dense (containing irrelevant information that confuses the agent).

Trim what isn’t needed. Add specificity where the output missed the mark. The goal is for each reference file to answer exactly the questions the corresponding step will ask — and nothing more.

What Goes in skill.md (And What Doesn’t)

This is where most people get it wrong. The instinct is to put everything important in the skill.md file: the process, the context, the rules, the examples, the output format.

Don’t do this.

The skill.md file is for process steps only. Everything else is a reference file. When you mix process and context together, two things happen:

The file gets long and hard to maintain.
The agent loses track of the procedural thread.

If you’ve ever given Claude a very long prompt and noticed the output drift from what you asked for near the end, you’ve seen this effect. Context crowding pushes procedural instructions out of attention.

This is also why code scripts often outperform markdown instructions for complex agent tasks — scripts enforce step execution in ways that free-form prose can’t. For highly structured workflows, consider whether a script-based approach fits better than a pure Markdown skill.

The clean separation is:

skill.md: what to do, in what order
reference files: what to know while doing it

Writing Process Steps That Actually Work

Good process steps are specific, testable, and sequential. Here’s a framework for evaluating each step you write:

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

Specific: Could a human follow this step without asking a clarifying question? If not, add more detail.

Testable: Could you look at the output and verify this step was completed? If not, the step is probably too vague.

Sequential: Does this step have to happen at this point in the process? If not, reorder or combine.

Some additional tips:

Start each step with a verb. “Read,” “Write,” “Check,” “Output.” This makes the procedural nature explicit.
Include conditional logic where needed. “If the brief includes a target word count, aim for that range. If not, default to 800–1,200 words.”
Specify what to do when something is missing. “If no examples are provided in examples/, use a neutral, professional tone.”

Effective prompt engineering for AI agents follows the same principles as skill writing: precision over length, specificity over generality.

Common Mistakes That Make Skills Brittle

Even well-intentioned skills can fail in predictable ways. There are a few common mistakes that consistently degrade skill performance.

Mixing Context and Process

Putting explanations, background, and rules directly in the skill.md file. The fix: move all context to reference files and point to them from the step that needs them.

Steps That Depend on Implicit Knowledge

Writing a step like “make sure it sounds like us” assumes Claude knows what “us” sounds like. Reference files solve this, but only if the steps actually reference them. Write: “Check tone against the criteria in brand-voice.md.”

No Output Format Specification

If your skill doesn’t specify exactly how the output should be structured, Claude will make it up every time. Add a final step that defines the output format precisely — or point to an output-format.md reference file.

Over-Engineering on the First Pass

Trying to handle every edge case before you’ve even run the skill once. Write the simplest version that handles the common case, test it, then add edge case handling based on what you actually observe.

Skills That Are Too Broad

A skill that tries to do five things will do none of them reliably. One task, one skill. If you need to do five things, build five skills and chain them into a workflow.

How to Structure Reference Files for Maximum Clarity

Reference files are only useful if they’re well-organized. A reference file that’s 2,000 words of undifferentiated prose is almost as bad as no reference file at all.

Follow these conventions:

Use headers. Break reference files into clearly labeled sections so Claude can locate relevant information quickly.

Keep each file focused on one type of context. Brand voice is brand voice. SEO rules are SEO rules. Don’t combine them.

Include negative examples where helpful. “Don’t use phrases like…” is often clearer than “Use phrases like…” alone.

Use a consistent format. If you use bullet lists in one reference file, use bullet lists in all of them. Consistency makes files easier to scan.

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Update files when standards change. The advantage of reference files is that you update the file once and every future run of the skill reflects the change. If your brand voice evolves, update brand-voice.md — not every individual skill.

Scaling: How Skills Chain Into Larger Workflows

Individual skills are useful. But the real payoff comes when skills work together.

Chaining skills into end-to-end workflows lets you build processes where the output of one skill becomes the input of the next. A content workflow might look like:

Research skill → produces a brief with key points and sources
Outline skill → turns the brief into a structured article plan
Writing skill → expands the outline into a full draft
Editing skill → applies brand voice and quality checks
Distribution skill → formats the article for each target platform

Each skill does one thing well. The chain does something much more powerful than any individual skill could.

This is the architecture described in Claude’s agentic operating system — a set of specialized skills that collaborate to handle complex business workflows end-to-end.

Building skills with chaining in mind from the start means:

Each skill’s output format matches the next skill’s expected input format.
Shared context (like brand guidelines) lives in a single reference file that all skills point to.
Failures are isolated — if one skill in the chain underperforms, you fix that skill without touching the others.

For more on shared brand context across skills, the business brain pattern is worth reading.

Making Skills That Improve Over Time

A skill that produces the same output quality on run 1,000 as it did on run 1 hasn’t learned anything. Skills can be designed to improve.

The mechanism is a learnings loop: after each run, the agent records what worked and what didn’t, and those observations feed back into the skill’s reference files or process steps. Building a learnings loop doesn’t require complex infrastructure — it’s mainly a matter of where you store feedback and how the skill accesses it.

At its simplest, a learnings loop involves:

A learnings.md file that logs successful patterns and failure modes.
A step in the skill that reads recent learnings before executing.
A step at the end that appends observations to learnings.md.

Over time, the skill accumulates institutional knowledge. Good patterns get reinforced. Bad patterns get documented and avoided.

You can also measure skill quality more formally using binary evaluations — pass/fail tests that check specific, verifiable criteria in the output. Binary evals are more reliable than subjective scoring because they remove interpretation from the assessment. Either the output contains the required keyword or it doesn’t. Either the output is under the word limit or it isn’t.

Where Remy Fits in a Skill-Based Workflow

Skills are fundamentally about encoding process into files that agents can read and execute reliably. Remy operates on the same principle at a different level of abstraction.

Where a Claude Code skill encodes a task process in a Markdown file, Remy encodes an entire application in a spec document. The spec describes what the app does — its data model, its logic, its interface — and Remy compiles that into a full-stack application with a real backend, a SQL database, and working auth.

If you’re building tools that your skill-based workflows need to interact with — a client portal, an internal dashboard, a submission form, a reporting system — Remy lets you build those full-stack apps from a structured spec rather than wiring up infrastructure manually.

The connection is real: skills need something to act on. Remy builds the things skills act on. A content team running a 5-skill content pipeline still needs somewhere to store briefs, drafts, and published pieces. That’s an app. And building that app from a spec is faster and more maintainable than building it from code.

You can try Remy at mindstudio.ai/remy.

Frequently Asked Questions

What is a Claude Code skill?

A Claude Code skill is a Markdown file that defines a reusable process for an AI agent. It tells Claude how to approach a specific type of task — the steps to follow, the files to reference, and the format of the expected output. Skills live in your project directory and are loaded when Claude executes the relevant task.

How is a skill different from a system prompt?

A system prompt sets general behavior for a session — tone, persona, constraints. A skill is a specific process document for a specific task. Skills are more granular and more reusable. You can have many skills in one project, each handling a different type of work, while the system prompt (often stored in a claude.md file) sets the overall operating context.

How long should a skill.md file be?

As short as possible while remaining complete. A well-written skill.md is usually 10–20 numbered steps. If it’s getting much longer than that, you’ve likely mixed process with context. Move the context to reference files and keep the skill.md focused on the procedural sequence.

Can I use someone else’s skill as a starting point?

Yes. The Claude Code Skills Marketplace offers pre-built skills for common tasks. You can install one and modify it to match your specific requirements — adjusting steps, swapping in your own reference files, or adding steps that reflect your team’s standards.

What’s the right way to handle edge cases in a skill?

Add conditional logic directly in the relevant step. Something like: “If the input is under 200 words, skip the outline step and proceed directly to the draft.” Don’t create a separate skill for every edge case — that leads to a proliferation of nearly-identical skills that are hard to maintain. Handle edge cases inline, in plain English.

How many skills should a project have?

There’s no fixed number, but a good rule of thumb is: one skill per distinct type of output. If two tasks produce the same kind of output, they can probably share a skill. If two tasks produce different kinds of output, they should be separate skills. Most production workflows end up with 5–15 skills covering the full range of tasks the agent handles.

Key Takeaways

Claude Code skills are reusable process documents — SOPs for your AI agent — stored as Markdown files in your project directory.
The skill.md file should contain only process steps. All context belongs in separate reference files.
Good process steps are specific, testable, and sequential. Each step should start with a verb and reference the files Claude needs to complete it.
Context rot degrades skill performance. Keep files focused and lean.
Skills chain together into larger workflows, with each skill’s output serving as the next skill’s input.
Skills can improve over time through learnings loops and binary evaluations.
If your skill-based workflows need full-stack tools to act on, try Remy — it compiles a structured spec into a real backend, database, and frontend, without manual infrastructure setup.