Skip to main content
MindStudio
Pricing
Blog About
My Workspace
ClaudeWorkflowsAutomation

How to Build a Self-Improving Marketing Skill with Claude Code and Eval.json

Create an eval.json with binary assertions, set up an autonomous improvement loop, and let Claude Code refine your marketing copywriting skill overnight.

MindStudio Team
How to Build a Self-Improving Marketing Skill with Claude Code and Eval.json

Why Most AI Marketing Skills Stay Mediocre

Marketing teams are shipping AI-generated copy faster than ever, but there’s a catch: output is only as good as the prompt, and most prompts never get systematically improved after the first draft. Someone eyeballs a few outputs, tweaks a line or two, and calls it done.

That’s not optimization — it’s guessing.

Using Claude Code and a structured eval.json file with binary assertions, you can set up an autonomous improvement loop that tests your marketing copywriting skill against real criteria, identifies exactly what’s failing, modifies the skill’s prompt, and repeats — all without you in the loop. You define success once; Claude Code does the iteration work overnight.

This guide walks through building that system from scratch: creating the marketing skill, writing the eval file, configuring Claude Code for autonomous operation, and reading the results in the morning.


What eval.json Is and Why Binary Assertions Work

An eval file is a structured set of test cases for your AI skill. Each case includes an input and a list of assertions — conditions the output must satisfy to pass.

The key word is binary. Each assertion either passes or fails. There’s no “pretty good” or “7 out of 10.” This matters because:

  • Claude Code can act on binary results. If an assertion fails, it knows exactly what to fix. A vague score leaves it guessing.
  • Progress is trackable. You can count failures across iterations and watch the skill improve.
  • Results are reproducible. The same output run against the same assertion always returns the same verdict.

The eval.json format isn’t a proprietary spec — it’s a pattern you define. Your eval runner and Claude Code both read the same file. The structure is flexible; the principle isn’t.

Binary Assertions vs. Rubrics

Rubric-based evals ask “is this copy engaging, on a scale of 1–5?” Binary assertions ask “does this copy contain a call-to-action?” The first requires judgment; the second is checkable.

For marketing copy, binary assertions map cleanly to real quality criteria:

  • Is the copy under 150 words?
  • Does it include a subject line?
  • Does it contain a CTA?
  • Does it mention the product name?
  • Does it avoid buzzwords like “synergy” or “seamlessly”?

Some criteria are inherently subjective — “does this feel natural?” — but you can still make them binary by using a secondary LLM judge. The judge prompt returns only “pass” or “fail,” keeping the assertion structure clean. Anthropic’s guidance on building evals for Claude covers this pattern in more detail.


Build the Marketing Copywriting Skill First

Before you can evaluate a skill, you need one. A “skill” here means a system prompt plus a defined input/output contract — something that takes structured inputs and produces marketing copy.

Define Your Inputs and Outputs

Start by deciding what the skill takes in and what it produces. For a cold email generator, a reasonable contract looks like this:

Inputs:

  • product_name — name of the product
  • product_description — one-sentence description
  • target_audience — who the email is for
  • pain_point — the problem the product solves
  • copy_type — email, social post, ad headline, etc.
  • max_words — word limit for the output

Output:

  • Cold email: subject line + email body
  • Social post: platform-specific copy with hashtags
  • Ad: headline + description

Keeping inputs structured makes your evals easier to write. Deterministic inputs mean deterministic test cases.

Write the Initial System Prompt

Your starting prompt doesn’t need to be perfect — that’s what the loop is for. Write something functional:

You are a B2B marketing copywriter. Write concise, persuasive copy that speaks directly to the target audience's pain points.

When writing cold emails:
- Keep the subject line under 10 words
- Open with the prospect's pain point, not a compliment or self-introduction
- Mention the product by name exactly once
- Include a single, clear CTA in the final sentence
- Stay under {max_words} words total

Avoid: buzzwords, passive voice, vague claims, and filler phrases.

Input:
Product: {product_name}
Description: {product_description}
Audience: {target_audience}
Pain point: {pain_point}

Save this as skill_prompt.txt. Claude Code will read and modify this file during the improvement loop.


Write Your eval.json with Binary Assertions

Now write the evaluation file. Each test case exercises a specific scenario. Each assertion checks a specific condition.

Here’s a complete example:

{
  "version": "1.0",
  "skill": "marketing_copywriter",
  "skill_prompt_file": "skill_prompt.txt",
  "test_cases": [
    {
      "id": "tc_001",
      "name": "SaaS cold email for engineering managers",
      "input": {
        "product_name": "TaskFlow",
        "product_description": "Project management tool built for remote engineering teams",
        "target_audience": "engineering managers at 50–200 person tech companies",
        "pain_point": "sprint reviews that run long because tasks aren't tracked in one place",
        "copy_type": "cold_email",
        "max_words": 150
      },
      "assertions": [
        {
          "id": "a1",
          "type": "word_count_max",
          "value": 150,
          "description": "Total copy must be 150 words or fewer"
        },
        {
          "id": "a2",
          "type": "contains_any",
          "patterns": ["schedule a demo", "book a call", "try free", "get started", "sign up"],
          "case_insensitive": true,
          "description": "Copy must include a recognizable CTA"
        },
        {
          "id": "a3",
          "type": "contains_string",
          "value": "TaskFlow",
          "case_insensitive": false,
          "description": "Copy must mention the product name"
        },
        {
          "id": "a4",
          "type": "not_contains_any",
          "patterns": ["synergy", "leverage", "unlock", "revolutionize", "game-changing", "seamlessly"],
          "case_insensitive": true,
          "description": "Copy must not use buzzwords"
        },
        {
          "id": "a5",
          "type": "has_field",
          "field": "subject_line",
          "description": "Output must include a subject line"
        },
        {
          "id": "a6",
          "type": "llm_judge",
          "prompt": "Does this cold email open with the prospect's pain point rather than a compliment or self-introduction? Answer only 'pass' or 'fail'.",
          "description": "Email must lead with pain point, not self-introduction"
        }
      ]
    },
    {
      "id": "tc_002",
      "name": "LinkedIn post for ecommerce founders",
      "input": {
        "product_name": "TaskFlow",
        "product_description": "Project management tool built for remote engineering teams",
        "target_audience": "ecommerce founders with in-house dev teams",
        "pain_point": "dev projects slipping past launch dates",
        "copy_type": "linkedin_post",
        "max_words": 200
      },
      "assertions": [
        {
          "id": "b1",
          "type": "word_count_max",
          "value": 200,
          "description": "LinkedIn post must be 200 words or fewer"
        },
        {
          "id": "b2",
          "type": "word_count_min",
          "value": 80,
          "description": "LinkedIn post must be at least 80 words"
        },
        {
          "id": "b3",
          "type": "contains_string",
          "value": "TaskFlow",
          "case_insensitive": false,
          "description": "Post must mention product name"
        },
        {
          "id": "b4",
          "type": "not_contains_any",
          "patterns": ["synergy", "leverage", "unlock", "revolutionize"],
          "case_insensitive": true,
          "description": "Post must avoid buzzwords"
        },
        {
          "id": "b5",
          "type": "llm_judge",
          "prompt": "Does this LinkedIn post end with a question or a clear CTA that invites engagement? Answer only 'pass' or 'fail'.",
          "description": "Post must close with an engagement hook"
        }
      ]
    }
  ]
}

Assertion Types That Work Well for Marketing Copy

A practical set of assertion types covers most marketing quality requirements:

Assertion TypeWhat It Checks
word_count_maxOutput is at or under a word limit
word_count_minOutput meets a minimum length
contains_stringSpecific string appears in output
contains_anyAt least one item from a list appears
not_contains_anyNone of a list of strings appear
has_fieldOutput JSON includes a required field
regex_matchOutput matches a regex pattern
llm_judgeSecondary LLM returns “pass” or “fail”

Start with the first six. Add llm_judge assertions only for criteria that can’t be captured with string matching. Each LLM judge call adds latency and cost to your eval loop.

How Many Test Cases Do You Need?

For an overnight run, 5–10 test cases is a good starting range. Too few and the loop over-optimizes for narrow scenarios. Too many and iterations take too long.

Cover at minimum:

  • Different copy formats (email, social, ad headline)
  • Different audience types
  • At least one edge case (very tight word limit, complex pain point)

Set Up Claude Code for Autonomous Operation

Claude Code is Anthropic’s agentic coding tool. It runs in your terminal, reads and writes files, executes scripts, and works through multi-step tasks without needing confirmation at each step. For this workflow, it’s the engine that runs the improvement loop.

Prerequisites

You’ll need:

  • Claude Code installed (npm install -g @anthropic-ai/claude-code)
  • An Anthropic API key with enough credits for an overnight run
  • A working directory containing skill_prompt.txt, eval.json, and an eval runner script

The eval runner is a small script — Python or JavaScript — that takes skill_prompt.txt and eval.json, calls your AI model with each test case, checks each assertion, and writes results to a JSON file. Claude Code can write this script for you if you ask it to.

A simple runner outputs results like this:

{
  "run_id": "run_003",
  "timestamp": "2025-01-15T02:14:22Z",
  "total_assertions": 11,
  "passed": 8,
  "failed": 3,
  "failures": [
    {
      "test_case": "tc_001",
      "assertion_id": "a2",
      "description": "Copy must include a recognizable CTA",
      "actual_output_snippet": "...reach out if you want to learn more."
    },
    {
      "test_case": "tc_001",
      "assertion_id": "a6",
      "description": "Email must lead with pain point",
      "judge_reasoning": "Email opens with 'I wanted to introduce TaskFlow...' — a self-introduction, not a pain point."
    }
  ]
}

The Improvement Prompt That Drives the Loop

This is the most important piece. The prompt you give Claude Code tells it how to iterate and when to stop. Here’s a reliable version:

You are improving a marketing copywriting skill prompt.

Files in this directory:
- skill_prompt.txt: the current skill prompt
- eval.json: test cases with binary assertions
- run_evals.py: the evaluation runner script

Your task:
1. Run `python run_evals.py` to execute all test cases.
2. Read the output. Identify every failing assertion.
3. Analyze why each failure occurred based on the current skill prompt.
4. Modify skill_prompt.txt to address the failures. Make targeted changes only — do not rewrite the entire prompt.
5. Re-run the evals.
6. Repeat steps 2–5 until all assertions pass, or until you reach 25 iterations.

After each iteration, append a summary to improvement_log.jsonl with:
- iteration number
- number of assertions passed and failed
- list of assertion IDs that failed
- brief description of the change you made

Rules:
- Never modify eval.json or run_evals.py.
- Never fabricate eval results — always run the script.
- If the same assertions fail for 5 consecutive iterations, try a fundamentally different approach to that section of the prompt.
- If all assertions pass, stop immediately and write a final summary to improvement_summary.md.

Save this as TASK.md in your project folder.

Setting Stopping Conditions

Always give Claude Code a hard stop. The 25-iteration limit prevents runaway cost if the skill hits a wall it can’t climb on its own. Tune this based on your budget for an overnight run.

A typical run using Claude Sonnet costs roughly $2–8 for 20–25 iterations across 5–10 test cases, depending on output length. Larger eval sets cost proportionally more.


Run the Loop Overnight

With your files in place, starting the loop is one command:

claude --dangerously-skip-permissions "Read TASK.md and execute the instructions in it."

The --dangerously-skip-permissions flag lets Claude Code write files and run shell commands without asking for confirmation at each step. This is what makes the loop truly autonomous. Only use it in a sandboxed project directory you control.

What Happens Each Iteration

A typical cycle:

  1. Runs python run_evals.py — generates pass/fail results per assertion
  2. Reads the failure list — identifies which assertions failed and what the output looked like
  3. Opens skill_prompt.txt — reads the current state of the prompt
  4. Makes a targeted edit — changes the specific instruction that caused the failure
  5. Writes the updated prompt back to skill_prompt.txt
  6. Appends to improvement_log.jsonl — records what changed and the new assertion counts
  7. Loops back to step 1

Early iterations typically show fast improvement — going from 6/11 to 9/11 in the first few cycles. Later iterations slow down as remaining failures tend to be harder to fix without causing regressions elsewhere.

Reading the Results

When you check in the next morning, start with improvement_summary.md. It tells you whether all assertions passed, or where the loop ended if it hit the iteration cap.

Then read improvement_log.jsonl. The iteration history shows how the skill evolved — which prompt changes worked, which caused regressions, and which assertions were hardest to satisfy. This log is often more valuable than the final prompt itself.

The final skill_prompt.txt is your improved skill. If you initialized a git repo in the project directory, you can diff the result against your original:

git diff HEAD skill_prompt.txt

Where MindStudio Fits This Workflow

The eval-driven loop described above is effective for refining prompts. But once you have a polished skill_prompt.txt, you still need somewhere to deploy it — somewhere your marketing team can actually use it, trigger it from other tools, or connect it to HubSpot, Notion, or your CMS.

That’s where MindStudio comes in.

MindStudio’s Agent Skills Plugin (@mindstudio-ai/agent) is an npm SDK that lets Claude Code — or any AI agent — call deployed MindStudio workflows as simple method calls. This creates a clean end-to-end architecture:

  • Your marketing copywriting skill lives in MindStudio as a workflow, with your refined prompt, input handling, and output formatting already in place
  • Claude Code calls it via agent.runWorkflow() during the eval loop, testing the deployed version rather than a local file
  • When the loop finishes, the improved skill is already live in MindStudio and accessible to your team via a web UI, a Slack command, or a HubSpot integration

MindStudio handles infrastructure concerns — rate limiting, retries, auth — so Claude Code can focus on the reasoning work: reading failures, modifying prompts, iterating.

You can also use MindStudio to build the eval runner itself. A MindStudio background agent configured to run on a schedule can execute your full eval suite regularly, push results to a Google Sheet or Notion database, and alert your team when the skill’s pass rate drops — turning a one-time optimization into ongoing quality monitoring.

Try MindStudio free at mindstudio.ai.


Frequently Asked Questions

What is an eval.json file in the context of AI skills?

An eval.json file is a structured test suite for an AI skill or prompt. It defines test cases — each with specific inputs and a list of assertions the output must satisfy. In a self-improvement workflow, Claude Code reads this file, generates outputs using the current skill prompt, checks each assertion, and uses the pass/fail results to modify the prompt. The process repeats until all assertions pass or the iteration limit is reached.

Why use binary assertions instead of scoring or rubrics?

Binary assertions are more useful for autonomous improvement because they’re unambiguous. When an assertion fails, the failure is specific and actionable — Claude Code knows what the output lacked and can target the relevant part of the prompt. Rubric-based scoring produces a number but not a direction. “This copy scored 6/10” doesn’t tell an agent what to change. Binary assertions also produce consistent results across runs, which matters for tracking progress over many iterations.

How long does a typical improvement loop take to run?

It depends on the size of your eval set, the model you’re using, and the iteration cap you set. With 5–10 test cases, 5–8 assertions each, and a 20-iteration limit, a full run typically takes 2–4 hours using Claude Sonnet. Running it overnight means you wake up to a finished result without waiting. Larger eval sets can push the runtime to 6–8 hours, which still fits an overnight window.

Can this approach work for AI skills beyond marketing copy?

Yes. The eval.json improvement loop works for any text-based AI skill where you can define binary pass/fail criteria. Practical examples include: customer support response drafting, email subject line generation, product description writing, SEO meta description generation, and sales email personalization. The core requirement is that quality criteria can be expressed as checkable conditions rather than pure subjective taste.

How do I prevent Claude Code from breaking passing assertions while fixing failing ones?

Two practices help here. First, make sure your eval runner reports all results — not just failures — so Claude Code can see immediately when a previously passing assertion flips to failing after a change. Second, instruct Claude Code in the improvement prompt to make targeted, minimal edits rather than rewriting the full skill prompt. If regressions keep occurring, add a hard rule: “If any previously passing assertion fails after your edit, revert the change and try a different approach.”

Is it safe to run Claude Code autonomously overnight with --dangerously-skip-permissions?

The flag is safe when you run Claude Code inside an isolated project directory that contains only the files relevant to your eval loop. Best practice: create a dedicated folder for the project, initialize a git repo so every file change is tracked, and don’t put sensitive credentials or unrelated files in it. The flag removes interactive confirmation prompts — it doesn’t grant system-level access beyond what your user account already has. Claude Code works within the directory you give it, not across your entire machine.


Key Takeaways

  • eval.json is a test suite for your AI skill. Define test cases with structured inputs and binary assertions to give Claude Code precise, actionable feedback.
  • Binary assertions work because they’re specific. Pass/fail conditions give Claude Code a direction to move, not just a score.
  • The loop structure is simple. Run evals → read failures → modify prompt → repeat. Claude Code handles all of it autonomously.
  • Set a hard iteration cap. A stopping condition keeps overnight runs predictable and prevents unnecessary cost.
  • Deployment closes the loop. A refined prompt is only useful when it’s accessible. Pairing this workflow with a platform like MindStudio means the improved skill is immediately available to your team and connected to the tools they already use.