How to Use Claude Code Skills 2.0: Built-In Evaluation and A/B Testing for AI Workflows

Q: Does Skills 2.0 work with agents other than Claude Code?

Yes. The @mindstudio-ai/agent SDK works with any agent that can execute JavaScript or Node.js functions — LangChain agents, CrewAI agents, AutoGPT-style agents, and custom-built systems. The evaluation system measures skill output quality regardless of which agent invoked the skill. The agent's identity doesn't affect how the skill is scored.

Q: How do I know when an A/B test result is actionable?

Look at the confidence value in the test results. Confidence above 0.9 means the result is robust — the winning variant is genuinely performing better, not just getting lucky on a particular test set. Between 0.8 and 0.9, the result is suggestive but worth validating with more examples. Below 0.8, treat the test as inconclusive. Also check the delta — a high-confidence result with a delta of 0.05 may not be worth acting on even if statistically real. Focus on differences of 0.3+ in overall score.

Why AI Workflow Quality Is Mostly Guesswork Right Now

If you’ve built more than a few AI workflows, you’ve felt this: you tweak a prompt, run it again, and decide the output “seems better.” But you can’t prove it. You don’t have a score. You don’t have a baseline. And if someone asks whether version two is actually an improvement over version one, you’re going on instinct.

This is the default state of AI workflow development. Most teams iterate by eyeballing outputs, maybe getting a second opinion, and shipping when it feels ready. That approach works well enough to get something functional out the door. It breaks down when you need to scale your workflow library, keep quality consistent across a team, or make a defensible decision between two competing approaches.

Claude Code Skills 2.0 is a direct response to that problem. It adds structured evaluation and A/B testing to the Skills development cycle — so instead of judging quality by feel, you’re measuring it against criteria you define, with scores you can track over time. This guide covers how the system works, how to set it up, how to run parallel comparisons, and how to build evaluation into your regular workflow development process.

Understanding Claude Code Skills: The Foundation

Before getting into what’s new in Skills 2.0, it helps to understand what the Skills system is and why evaluation becomes important once you’re working with it at any scale.

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Claude Code is Anthropic’s agentic coding tool — a terminal-based AI that operates directly in your development environment. It reads files, writes code, executes commands, and completes multi-step engineering tasks autonomously. Unlike a chat interface, Claude Code takes on actual work in your codebase rather than just offering suggestions.

The Skills layer sits on top of this. MindStudio’s Agent Skills Plugin — the npm package @mindstudio-ai/agent — extends Claude Code (and other AI agents) with a library of typed, callable workflow capabilities. Instead of Claude Code having to figure out how to search the web, send an email, generate an image, or pull data from a CRM, it can call agent.searchGoogle(), agent.sendEmail(), agent.generateImage(), or agent.runWorkflow() and get structured results back.

The architecture separates concerns cleanly. Claude Code handles reasoning and decision-making. Skills handle execution. The SDK handles infrastructure — rate limiting, retries, authentication. Everyone focuses on what they’re good at.

The Skills Library and Why It Grows

Once you start using Skills, you tend to accumulate them. You build a skill for summarizing support tickets. Another for drafting product updates. Another for classifying inbound leads. Each one is a workflow in MindStudio that gets exposed as a callable method to your agent.

With a handful of skills, quality management is manageable. You know each one, you’ve tested it manually, you have a rough sense of how it performs. But at twenty or thirty skills — especially across a team — that informal awareness breaks down. You start relying on whoever built each skill to vouch for its quality. Updates happen and nobody knows if they made things better or worse.

This is where evaluation stops being a nice-to-have and becomes a genuine need.

What Skills 2.0 Adds

Skills 2.0 extends the original execution-focused system with a measurement layer. The additions are:

Structured evaluation — define rubrics with weighted criteria; skills are scored against them automatically
A/B testing — run two or more skill versions against the same inputs in parallel and compare scores
Version tracking — skills are versioned, and evaluation scores are stored per version
Batch evaluation — run a skill against a full test dataset and aggregate the results
Production monitoring — sample live traffic and evaluate it against your rubric on an ongoing basis
Pre-built criteria library — common quality dimensions (accuracy, completeness, tone, format compliance) available out of the box

All of this is accessible through the same @mindstudio-ai/agent SDK, with evaluation methods alongside the existing capability methods.

How Built-In Evaluation Works

The evaluation system in Skills 2.0 is built around two concepts: criteria and rubrics. Getting these right determines how useful your evaluation scores will be.

Criteria: Defining What “Good” Means

A criterion is a single, specific quality dimension. Each criterion has:

A name — a short identifier you’ll use in code and results
A type — binary (pass/fail), scored (numeric scale), or reference-based (compared against a reference output)
A weight — how much this criterion contributes to the overall score
A description — a plain-language explanation used by the evaluation model when judging outputs

The description is the most important part. The evaluation model reads it and uses it to decide how to score each output. Vague descriptions produce inconsistent scores. Specific descriptions produce reliable ones.

Binary criteria work for objective, testable conditions:

“Does the output contain a valid JSON object matching the required schema?”
“Is the response in English?”
“Is the output fewer than 200 words?”

Scored criteria (typically 1–5 or 1–10) work for dimensions that exist on a spectrum:

Accuracy of information
Completeness of coverage
Tone match with brand guidelines
Helpfulness of the response

Reference-based criteria compare the output against a gold-standard example you provide. These are useful when you have a high-quality human-written example and want to measure how closely the skill’s output matches it — in meaning, structure, or specific content.

Building a Rubric

A rubric groups multiple criteria with weights that sum to 1. The weighted average of all criterion scores gives you the overall score for a single evaluation run.

Here’s an example rubric for a skill that summarizes customer support tickets:

{
  "rubric": {
    "name": "support_ticket_summary",
    "version": "1.0",
    "criteria": [
      {
        "name": "accuracy",
        "type": "scored",
        "scale": 5,
        "weight": 0.40,
        "description": "The summary accurately reflects what the customer reported. It does not add information that wasn't in the original ticket, and it doesn't misrepresent the customer's problem or sentiment."
      },
      {
        "name": "completeness",
        "type": "scored",
        "scale": 5,
        "weight": 0.25,
        "description": "The summary includes all significant issues the customer raised. No major complaint, feature request, or question from the original ticket is omitted."
      },
      {
        "name": "actionability",
        "type": "scored",
        "scale": 5,
        "weight": 0.20,
        "description": "The summary is written in a way that gives the support agent enough context to act. It identifies the core problem clearly and includes any relevant context (account type, steps taken, error messages mentioned)."
      },
      {
        "name": "conciseness",
        "type": "scored",
        "scale": 5,
        "weight": 0.10,
        "description": "The summary is appropriately brief. No filler sentences, no repetition of information already stated, no tangential context that doesn't help the support agent."
      },
      {
        "name": "valid_output_schema",
        "type": "binary",
        "weight": 0.05,
        "description": "The output is valid JSON that matches the expected schema: { summary: string, priority: 'low' | 'medium' | 'high', category: string }"
      }
    ]
  }
}

The weights reflect what matters most for this use case. Accuracy is heavily weighted because a summary that misrepresents the ticket is worse than useless — it leads support agents in the wrong direction. Schema validity gets a small weight because it’s either fully correct or completely broken; partial credit isn’t meaningful here.

Registering and Using the Rubric

Rubrics are registered once and then referenced by ID in evaluation calls. Here’s how that looks in code:

const { MindStudio } = require('@mindstudio-ai/agent');

async function setupEvaluation() {
  const agent = new MindStudio({ apiKey: process.env.MINDSTUDIO_API_KEY });

  const rubric = await agent.registerRubric({
    name: 'support_ticket_summary',
    criteria: [
      {
        name: 'accuracy',
        type: 'scored',
        scale: 5,
        weight: 0.40,
        description: 'The summary accurately reflects what the customer reported...'
      },
      // ... remaining criteria
    ]
  });

  console.log('Rubric registered:', rubric.id);
  return rubric.id;
}

Once you have the rubric ID, you can run evaluations:

async function evaluateSingleRun(rubricId, skillId) {
  const agent = new MindStudio({ apiKey: process.env.MINDSTUDIO_API_KEY });

  const result = await agent.evaluateSkill({
    skillId: skillId,
    input: {
      ticketText: "Hi, I've been trying to cancel my subscription for three days. I click the cancel button in settings but nothing happens. I've tried Chrome and Firefox. My account email is [redacted]. Please help.",
      ticketId: 'TKT-4821'
    },
    rubricId: rubricId,
    options: {
      evaluatorModel: 'claude-3-5-sonnet' // optional — defaults to current Claude version
    }
  });

  console.log('Skill output:', result.output);
  console.log('Scores:', result.scores);
  // {
  //   accuracy: 4.5,
  //   completeness: 4.0,
  //   actionability: 4.2,
  //   conciseness: 3.8,
  //   valid_output_schema: true,
  //   overall: 4.2
  // }

  return result;
}

The result.output gives you the actual skill output. The result.scores gives you both individual criterion scores and the weighted overall. Run this against a few typical inputs to get a quick read on how the skill is performing.

For a statistically meaningful picture, you need batch evaluation.

Batch Evaluation: Getting a Reliable Baseline

A single evaluation run tells you how the skill handled one input. A batch run tells you how it handles your full range of typical inputs. The batch result gives you averages, score distributions, and failure rates — the data you need to set a credible baseline and track improvement over time.

async function runBatchEvaluation(rubricId, skillId, testDataset) {
  const agent = new MindStudio({ apiKey: process.env.MINDSTUDIO_API_KEY });

  const batchResult = await agent.batchEvaluate({
    skillId: skillId,
    inputs: testDataset, // array of input objects
    rubricId: rubricId,
    options: {
      parallel: true,
      maxConcurrency: 5,
      storeResults: true // saves results to MindStudio for later review
    }
  });

  console.log('Batch evaluation summary:', batchResult.summary);
  // {
  //   totalRuns: 50,
  //   completedRuns: 49,
  //   failedRuns: 1,
  //   averageScores: {
  //     accuracy: 4.1,
  //     completeness: 3.8,
  //     actionability: 4.0,
  //     conciseness: 3.9,
  //     valid_output_schema: 0.96,
  //     overall: 4.0
  //   },
  //   standardDeviations: {
  //     overall: 0.4
  //   },
  //   distribution: {
  //     '1.0-2.0': 1,
  //     '2.0-3.0': 3,
  //     '3.0-4.0': 18,
  //     '4.0-5.0': 27
  //   }
  // }

  return batchResult;
}

A few things to pay attention to in this output:

The standardDeviations.overall of 0.4 indicates modest variance — most runs are clustering around the 4.0 average. A standard deviation above 0.8 would signal an inconsistent skill that performs very well on some inputs and poorly on others.

The distribution breakdown shows where scores cluster. If you see a lot of runs in the 3.0–4.0 range and a few outliers below 2.0, look at those low-scoring inputs specifically. They usually share a pattern — a type of input the skill handles poorly.

The failedRuns: 1 means the skill errored on one input (not just scored low — actually returned an error). Consistent failure rates above 2–3% warrant investigation before relying on the skill in production.

Running A/B Tests on Skill Versions

Evaluation tells you how a skill performs in absolute terms. A/B testing tells you which of two skill versions performs better. The distinction matters: a skill might score 4.0 consistently, but so does a variant you just built. Without a direct comparison on the same inputs, you can’t tell if the difference is real or just statistical noise.

When A/B Testing Is Worth the Effort

Not every change to a skill needs a formal A/B test. A few heuristics:

Run an A/B test when:

You’ve swapped the underlying model (e.g., moving from Claude 3 Haiku to Claude 3.5 Sonnet, or testing cost reduction by moving to a smaller model)
You’ve significantly rewritten the prompt — not tweaked a word, but changed the approach
You’ve changed the output structure in a way that affects how the skill is used downstream
You’re choosing between two fundamentally different approaches to the same task
You need to justify a decision to stakeholders and want evidence beyond personal judgment

Don’t bother with a formal A/B test when:

You’re fixing a clear bug (bad formatting, wrong field name)
The change is a minor word-level prompt edit
You’re adding an instruction that handles an edge case you observed

Catch up on Hermes — free 60-minute live workshop

For small changes, a before-and-after batch evaluation is faster and sufficient. A/B testing is for meaningful decisions.

Setting Up a Parallel A/B Test

The agent.abTest() method takes an array of variants — each specifying a skill ID — and a shared set of inputs. Both versions run against every input. Scores are computed against the same rubric. The system returns a side-by-side comparison with a confidence score.

async function runABTest(rubricId, testDataset) {
  const agent = new MindStudio({ apiKey: process.env.MINDSTUDIO_API_KEY });

  const abResult = await agent.abTest({
    variants: [
      {
        skillId: 'ticket-summarizer-v1',
        label: 'control',
        description: 'Original prompt with Claude 3 Haiku'
      },
      {
        skillId: 'ticket-summarizer-v2',
        label: 'treatment',
        description: 'Revised prompt with more explicit completeness instruction, Claude 3.5 Sonnet'
      }
    ],
    inputs: testDataset,
    rubricId: rubricId,
    options: {
      parallel: true,
      maxConcurrency: 4,
      storeResults: true
    }
  });

  console.log('A/B test results:', abResult.comparison);
  // {
  //   control: {
  //     averageScore: 3.8,
  //     standardDeviation: 0.5,
  //     breakdown: {
  //       accuracy: 3.9,
  //       completeness: 3.4,
  //       actionability: 3.8,
  //       conciseness: 4.1,
  //       valid_output_schema: 0.96
  //     }
  //   },
  //   treatment: {
  //     averageScore: 4.2,
  //     standardDeviation: 0.4,
  //     breakdown: {
  //       accuracy: 4.3,
  //       completeness: 4.0,
  //       actionability: 4.2,
  //       conciseness: 3.8,
  //       valid_output_schema: 0.98
  //     }
  //   },
  //   winner: 'treatment',
  //   delta: 0.4,
  //   confidence: 0.93,
  //   pValue: 0.03
  // }

  return abResult;
}

The confidence: 0.93 tells you there’s a 93% chance the treatment is genuinely better — not just scoring higher due to random variation in the test set. The delta: 0.4 is the average score difference. The pValue: 0.03 is the standard statistical p-value for anyone who prefers to think in those terms.

A confidence above 0.90 on a dataset of 50+ examples is generally actionable. Between 0.80 and 0.90, the result is suggestive but not definitive — consider expanding the test set. Below 0.80, the difference isn’t meaningful enough to act on.

Reading the Per-Criterion Breakdown

The overall score comparison gives you a headline. The per-criterion breakdown gives you the story behind it.

In the example above, treatment wins on accuracy (4.3 vs 3.9), completeness (4.0 vs 3.4), and actionability (4.2 vs 3.8). But control scores higher on conciseness (4.1 vs 3.8). That’s not surprising — a more thorough prompt that improves completeness often adds some verbosity as a side effect.

This breakdown tells you whether a tradeoff is acceptable. In this case, a 0.3 drop in conciseness in exchange for a 0.6 gain in completeness is a reasonable deal for a support ticket summarizer. If this were a skill generating SMS notifications where brevity is critical, the same result might lead to a different decision.

Looking at criterion-level results also helps you understand why a variant is performing better or worse, which informs what to try next.

Comparing More Than Two Variants

The variants array accepts more than two items, so you can run a three-way or four-way comparison in one test:

const abResult = await agent.abTest({
  variants: [
    { skillId: 'summarizer-claude-haiku', label: 'haiku' },
    { skillId: 'summarizer-claude-sonnet', label: 'sonnet' },
    { skillId: 'summarizer-claude-opus', label: 'opus' }
  ],
  inputs: testDataset,
  rubricId: rubricId
});

This is useful for model comparison — you can see quality and (separately) latency/cost differences across model tiers. The confidence metric in this case reflects the winner’s margin over the second-best variant.

A practical caution: comparing more variants without a larger test dataset reduces statistical power. If you’re comparing four variants and only have 25 test inputs per variant, the results are less reliable. For multi-variant tests, aim for 50+ inputs in your test dataset.

Building Evaluation Into Your Development Workflow

The real value of Skills 2.0 evaluation isn’t any single test run — it’s having a structured process that produces a history of score data as you iterate. That history is what lets you answer questions like “is this skill better than it was three months ago?” or “did that model upgrade actually pay off?”

Step 1: Establish a Baseline

When you first deploy a skill, run a batch evaluation against your test dataset. Record the overall score and per-criterion scores. This is your baseline. Even if the score is 3.2 out of 5 — not great — you have a starting point to improve from, and you’ll know when you’ve made progress.

Store baselines in a simple format alongside your skill definitions. If you’re using version control, include the baseline scores in the skill’s README or a QUALITY.md file.

Step 2: Write a Quality Hypothesis Before Changing Anything

Before making a change to a skill, write down what you expect the change to do. “Rewriting the instruction to be more explicit about completeness should raise the completeness score from 3.4 to at least 3.8.” This sounds like overkill for small changes, but it forces you to think about the direction and magnitude of the improvement before you’re looking at results. It also helps you spot when a change that raised one score unintentionally lowered another.

Step 3: Change One Variable at a Time

If you change the prompt, the model, and the output format simultaneously and the score improves by 0.6, you don’t know what drove the improvement. Maybe the prompt change did 90% of the work. Maybe the model swap was actually a regression on completeness but the format change masked it. Isolating one change per iteration is slower but it gives you real information about what’s working.

The exception: if you’re doing a clean rewrite of a skill (completely new approach), testing the old vs. new as a complete package is reasonable. Just don’t try to attribute the result to individual changes.

Step 4: Run the A/B Test

With your change implemented as a new skill version, run an A/B test using your standard test dataset. Check the confidence score. If it’s above 0.9 and the new version wins, you have a clear signal.

If confidence is between 0.8 and 0.9, consider:

Expanding the test dataset with more examples and re-running
Whether the delta (score difference) is large enough to matter practically, even if not statistically certain

If the test is inconclusive (confidence below 0.8), don’t ship the change yet. Either the change isn’t making a difference, or your test set isn’t big enough to detect a small difference. Both are worth knowing.

Hermes, walked through line by line — free 1-hour workshop

Step 5: Update the Skill and Re-Baseline

If the A/B test confirms the new version is better, update the skill to the new version and run a fresh batch evaluation to establish the new baseline. The old baseline is preserved in the version history — you can always look back at where a skill started.

Step 6: Enable Production Monitoring

Once a skill is in production, evaluation shouldn’t stop. Real-world inputs often differ from test datasets over time — distributions shift, new input patterns emerge, users behave differently than you expected. A skill that scores 4.2 on your test set might score 3.5 on actual production traffic.

Production monitoring samples a percentage of live skill calls and runs them through the evaluation rubric automatically:

await agent.enableProductionMonitoring({
  skillId: 'ticket-summarizer',
  rubricId: rubricId,
  samplingRate: 0.05, // evaluate 5% of production calls
  alertThreshold: 3.5, // alert if average drops below this
  alertWebhook: 'https://your-alerting-endpoint.com/webhook'
});

The alert fires if the rolling average score drops below your threshold. You can route that alert to Slack, PagerDuty, email, or any webhook. Set the threshold at a meaningful distance below your baseline — about 0.5 points below is a reasonable default. That gives you signal without constant noise.

Designing a Good Test Dataset

The quality of your evaluation is bounded by the quality of your test dataset. A weak test set produces scores that don’t reflect real performance. A good one gives you results you can trust.

What Makes a Good Test Dataset

Diversity is more important than volume. Fifty identical-style inputs teach you less than twenty highly varied ones. Cover different input lengths, different formats, different edge cases, different topic areas within the skill’s domain.

Include edge cases explicitly. The inputs that trip up AI workflows are rarely the clean, well-formed ones — they’re the messy ones. Inputs with typos. Inputs that are missing information. Inputs in unexpected formats. Inputs from users who are frustrated. Make sure these are in your test set.

Use real data where possible. Synthetic test data generated by AI tends to be too clean and uniform. It lacks the irregularity of real user-generated content. If you have production data (even a few weeks’ worth), use real examples (appropriately anonymized) to anchor your test set.

Don’t use training data. If you used specific examples to tune the skill’s prompt, don’t include those exact examples in your evaluation dataset. You’d be testing on data the skill has already been optimized for, which inflates scores.

Keep the dataset consistent over time. When you run evaluations at different points in time — to track progress or compare versions — use the same core dataset. Adding new examples is fine. Removing old ones changes what you’re measuring.

How Many Examples Do You Need?

Use case	Minimum	Recommended
Quick directional check	10–15	20
Reliable baseline	25–30	40–50
A/B test with high confidence	40+	60+
Multi-variant test (3+)	50+	80+
Production monitoring calibration	100+	200+

These aren’t rigid rules — they’re practical starting points. The system’s confidence score adjusts based on both sample size and variance. High-variance skills need larger samples to produce reliable results.

Common Mistakes With AI Workflow Evaluation

Writing Descriptions That Are Too Vague

The most common failure mode in evaluation setup. The description is what the evaluation model reads when deciding how to score. If it’s imprecise, scores become inconsistent.

Bad: “The output should be high quality and useful.” Good: “The output directly answers the question asked without adding unsolicited context or caveats. It does not hedge unnecessarily, and it addresses the specific situation described rather than giving generic advice.”

The good description is long, but it’s testable. The evaluation model can actually apply it consistently.

Treating Scores as More Precise Than They Are

A skill that scores 4.0 on 50 inputs isn’t objectively better than one that scores 3.9. The difference might be noise, or it might reflect how the evaluation model was in a slightly different mood on two different runs (yes, this happens — LLM evaluators are stochastic). Focus on differences of 0.3+ as potentially meaningful, and verify with the confidence metric before acting.

Using the Same Rubric for Every Skill

A rubric designed for customer support summaries shouldn’t be used to evaluate code generation output or marketing copy. The criteria and their weights need to reflect the specific quality requirements of each skill type. Keep a library of rubrics — one per skill category — and update them as your understanding of quality evolves.

Evaluating Without a Test Dataset and Calling It Good

Running evaluation on two or three handpicked examples is better than nothing, but it’s not a baseline. Those examples were almost certainly chosen because they’re representative of the easy cases. A real test dataset covers the full distribution, including the hard cases.

Forgetting to Update Rubrics When Skills Mature

A rubric you wrote when a skill was in early development reflects your initial understanding of what good looks like. Six months later, after seeing real usage, your definition of quality is probably sharper. If you don’t update the rubric, you’re measuring against an outdated standard and may miss real problems.

How MindStudio Fits Into This Picture

Evaluation and A/B testing are most powerful when they’re connected to everything else in your development and deployment process — not sitting in a separate evaluation tool you have to maintain independently.

The Skills 2.0 evaluation system lives inside MindStudio, which means it’s integrated with the platform where you build, manage, and deploy the skills themselves. A few specific ways this integration matters:

Building and evaluating in the same place. When you build a workflow in MindStudio’s visual builder, each step can become a skill. You can run evaluations on individual steps without having to extract them from the workflow context. Found a quality problem in step 4 of a 7-step pipeline? Evaluate step 4 in isolation, A/B test two approaches, plug the winner back into the workflow — without rebuilding anything.

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

200+ models, one rubric. One of the most useful A/B tests is model comparison: does Claude 3.5 Sonnet do this task better than Claude 3 Haiku? Does a specialized smaller model beat a general-purpose larger one? In MindStudio, all models are available without separate accounts or API keys. You can swap models in a skill in minutes, run the same rubric against both, and get a clean comparison. The rubric and test dataset stay constant; only the model changes.

Team rubric libraries. If multiple people on your team are building and maintaining skills, shared rubrics ensure you’re measuring quality consistently. The same accuracy criterion applies to your skill and to your colleague’s — they’re not defining it differently in their own evaluation setup.

API-accessible evaluation for CI/CD. The MindStudio evaluation API can be called from CI/CD pipelines. When a workflow is updated and a deployment is triggered, an automated evaluation run kicks off using the skill’s registered rubric and test dataset. If the score drops below threshold, the pipeline flags it before the change goes live. This catches regressions automatically rather than waiting for someone to notice a problem in production.

Evaluation alongside MindStudio’s broader workflow tools. Skills don’t exist in isolation — they’re often called within larger automated workflows that include email triggers, scheduling, integrations with tools like HubSpot or Salesforce, and multi-step reasoning chains. MindStudio manages all of this in one place. Evaluation scores for individual skills can be surfaced alongside workflow-level metrics, giving you a complete picture of where quality issues are occurring in a pipeline.

If you’re using Claude Code and want to get started with Skills 2.0, you can set up the npm SDK and create your first rubric within a free account at mindstudio.ai. The evaluation system, batch runs, and A/B testing are all available without any paid tier requirement to start experimenting.

For Claude Code-specific setup and example rubrics for common skill types, the MindStudio documentation includes quickstart guides for integrating the Skills plugin with agentic coding workflows.

Frequently Asked Questions

What is Claude Code Skills 2.0?

Skills 2.0 is an update to MindStudio’s Agent Skills system — the npm package @mindstudio-ai/agent that lets Claude Code and other AI agents call MindStudio workflows as typed method calls. Version 2.0 adds built-in evaluation (scoring skill outputs against structured rubrics) and A/B testing (comparing two or more skill versions in parallel on the same inputs). The original Skills system handled execution. Version 2.0 adds measurement.

How does Skills 2.0 evaluation differ from reading outputs manually?

Manual review doesn’t scale, and it’s not consistent. The same person reviewing the same output on different days may rate it differently. Skills 2.0 evaluation uses rubrics with explicit, testable criteria — scored by Claude or another evaluation model. You get a numeric score per criterion, an overall weighted score, and a history of scores across versions and time. That turns quality from a judgment call into a measurable, trackable attribute.

What model evaluates skill outputs?

By default, a current Claude model evaluates outputs. You can specify a different evaluation model using the evaluatorModel option in agent.evaluateSkill(). Some teams use a more capable model for evaluation even if the skill itself uses a lighter, faster one — since evaluation is async and latency isn’t a concern. You can also run evaluations with multiple evaluation models to check for inter-evaluator consistency.

How many test inputs do I need for reliable evaluation results?

Hermes Crash Course — free 1-hour live workshop

For a directional sense of performance, 15–20 inputs is a starting point. For a reliable baseline you’d make decisions from, 40–50 is the practical minimum. For A/B tests where you need the confidence metric to be meaningful, 50+ is recommended. The system’s confidence score accounts for sample size — if it’s returning low confidence values, that’s your signal to expand the dataset before acting on the results.

Does Skills 2.0 work with agents other than Claude Code?

Yes. The @mindstudio-ai/agent SDK works with any agent that can execute JavaScript or Node.js functions — LangChain agents, CrewAI agents, AutoGPT-style agents, and custom-built systems. The evaluation system measures skill output quality regardless of which agent invoked the skill. The agent’s identity doesn’t affect how the skill is scored.

Can I use Skills 2.0 evaluation for image or video generation skills?

Evaluation in Skills 2.0 is primarily designed for text-based skill outputs. For image or video generation skills, you can add text-based analysis steps — for example, using a vision model to describe the image and then evaluating the description against your criteria. Native pixel-level image quality evaluation isn’t currently supported. Text and structured data outputs are the primary use case.

What’s the difference between a rubric and an evaluation criterion?

A criterion is one specific dimension of quality — a single thing you’re measuring. A rubric is a collection of criteria with weights that define overall quality for a given skill type. When you run an evaluation, you pass a rubric ID. The system scores the output against every criterion in that rubric and computes a weighted overall score. You register rubrics once and reuse them across multiple evaluation runs and A/B tests.

How do I know when an A/B test result is actionable?

Look at the confidence value in the test results. Confidence above 0.9 means the result is robust — the winning variant is genuinely performing better, not just getting lucky on a particular test set. Between 0.8 and 0.9, the result is suggestive but worth validating with more examples. Below 0.8, treat the test as inconclusive. Also check the delta — a high-confidence result with a delta of 0.05 may not be worth acting on even if statistically real. Focus on differences of 0.3+ in overall score.

Can I run production A/B tests — serving different skill versions to different real users?

Skills 2.0 A/B testing is primarily designed for development-time evaluation against test datasets. Production traffic splitting (routing live requests to variant A vs. variant B and comparing outcomes) is a separate capability that requires additional setup through MindStudio’s workflow configuration. The evaluation system’s productionMonitoring feature lets you sample and score live traffic against a rubric, but this is for a single deployed version rather than a head-to-head split.

Key Takeaways

Building AI workflows without evaluation is manageable at small scale. It becomes a liability once you have many skills, a team making changes, and real users depending on quality. Skills 2.0 gives you a practical path to measurement without requiring a separate evaluation infrastructure.

The main points:

Rubrics need specific criteria descriptions. Vague descriptions produce inconsistent scores. Write criteria that a reasonable person (or model) can apply consistently to any output.
Batch evaluation produces baselines, not single runs. Always run at least 30–50 test inputs to get a stable average. Single-run scores are for debugging, not decision-making.
Change one variable between A and B. If you change multiple things at once, a winning A/B result doesn’t tell you what worked.
The confidence score matters as much as the delta. A 0.5 advantage with 0.7 confidence is not actionable. A 0.3 advantage with 0.95 confidence is.
Production monitoring catches what test sets miss. Real user inputs diverge from test datasets over time. Sample and score live traffic to detect drift before it becomes a user-facing problem.
Evaluation is a loop, not a gate. The value compounds over time as you build a history of scores and baselines across skill versions.

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

MindStudio’s Skills 2.0 system — including the evaluation SDK, rubric management, batch testing, and A/B testing tools — is available to try for free at mindstudio.ai. If you’re building with Claude Code or any other AI agent framework, the Skills plugin setup takes under ten minutes and works alongside whatever agent architecture you’re already using.