Claude Code Skill Audit with Dynamic Workflows: How to Score and Rank 40+ Skills at Once

Why Skill Audits Are Painful (and How Dynamic Workflows Fix That)

Running a thorough skill audit is one of those tasks that feels straightforward until you’re actually doing it. You have forty-plus skills to evaluate, each needing a consistent rubric, evidence collection, gap analysis, and a priority rank. Do them sequentially and you’re looking at hours of work. Do them inconsistently and the rankings mean nothing.

Claude Code’s dynamic workflows change the math on this problem. Instead of processing skills one by one, you can fan out across all of them in parallel, apply a consistent scoring function, then aggregate and rank the results automatically. This article shows exactly how to build that system — from the workflow architecture down to the prompts and data structures that make it work.

What Dynamic Workflows Actually Are in Claude Code

Before getting into the build, it helps to be precise about terminology.

In Claude Code, a dynamic workflow is one where the number of tasks, agents, or branches isn’t fixed at design time — it’s determined at runtime based on input data. Contrast this with a static pipeline, where you hardcode step one, step two, step three.

For a skill audit, dynamic workflows matter because:

You don’t always know upfront how many skills you’re auditing
New skills can be added to the list without changing the workflow logic
Each skill might require slightly different evaluation criteria depending on its category

The pattern that makes this tick is fan-out / fan-in:

Fan-out: Take a list of skills, spawn a parallel evaluation task for each one
Evaluate: Each task scores a single skill against your rubric
Fan-in: Collect all scores, normalize them, and produce a ranked output

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Claude Code supports this through its sub-agent spawning capabilities. You write an orchestrator that reads the skill list, creates child tasks dynamically, and waits for all results before producing the final ranking.

Designing the Skill Scoring Rubric

The quality of your audit depends almost entirely on your scoring rubric. Before writing a single line of workflow code, define what “score” actually means.

Choosing Your Dimensions

A robust skill score usually has three to five dimensions. A common starting point:

Proficiency — How well can you perform this skill right now? (1–10)
Frequency — How often does this skill come up in your actual work? (1–10)
Gap vs. Target — How far are you from where you need to be? (1–10, where 10 = massive gap)
Urgency — How soon does the gap need to be closed? (1–10)
Effort to Improve — How hard is it to level this skill up? (1–10, inverted for ranking)

Each dimension gets a weight. You then compute a weighted composite score per skill, and the ranking sorts by that composite.

Deciding What “Worst to Best” Means

For a skill audit, “worst” usually means “highest priority to fix” — not just “lowest proficiency.” A skill you use daily with a big gap is worse than a skill you barely use where you’re already mediocre.

That framing changes your weighting. Frequency and gap tend to get higher weights than raw proficiency.

A sample weighting:

Dimension	Weight
Gap vs. Target	35%
Frequency	30%
Urgency	20%
Proficiency (inverted)	15%

Adjust these based on your context. The important thing is that you define them before you build — not after you see results you don’t like.

Setting Up the Claude Code Architecture

The Three-Layer Structure

A parallel skill audit in Claude Code has three layers:

Layer 1: Orchestrator Agent Reads the skill list, instantiates one evaluation task per skill, manages concurrency, collects results.

Layer 2: Evaluator Agents (spawned dynamically) Each evaluator receives one skill and a shared rubric. It gathers evidence (from your notes, a self-assessment form, or a provided context document) and outputs a structured score object.

Layer 3: Aggregator Takes all score objects, computes composite scores, normalizes if needed, sorts the list, and generates the ranked output with recommendations.

File Structure

skill-audit/
├── orchestrator.md          # Orchestrator agent prompt
├── evaluator.md             # Template prompt for evaluator agents
├── rubric.json              # Scoring weights and dimension definitions
├── skills.json              # Your list of 40+ skills with metadata
├── context/                 # Optional: notes or evidence per skill
│   ├── python.md
│   ├── sql.md
│   └── ...
└── outputs/
    ├── scores/              # Individual score files per skill
    └── audit_report.md      # Final ranked report

This structure keeps the orchestrator simple — it just iterates over skills.json, spawns an evaluator per skill, and writes results into outputs/scores/.

Building the Orchestrator

The Orchestrator Prompt

Your orchestrator prompt needs to be precise about three things: what inputs it reads, how it spawns sub-tasks, and what it does with results.

Here’s a working orchestrator prompt:

You are running a skill audit for the provided skills list.

1. Read `skills.json` to get the full list of skills.
2. Read `rubric.json` to get the scoring dimensions and weights.
3. For each skill in the list, spawn a sub-agent using the `evaluator.md` prompt template. 
   Pass the skill name, category, and any available context file path.
4. Run all evaluations in parallel. Do not wait for one to finish before starting the next.
5. Once all evaluators return results, write each result as a JSON file to `outputs/scores/{skill_name}.json`.
6. After all scores are written, call the aggregator step.

If any evaluator returns an error or incomplete result, log it and continue. 
Do not abort the entire audit for a single failure.

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

The key instruction is step 4: explicit parallelism. Claude Code will fan out the evaluations concurrently rather than waiting for each one.

Skills JSON Format

Your skills.json should be flat and machine-readable:

[
  {
    "id": "python",
    "name": "Python",
    "category": "Programming",
    "context_file": "context/python.md",
    "target_level": 8
  },
  {
    "id": "sql",
    "name": "SQL",
    "category": "Data",
    "context_file": "context/sql.md",
    "target_level": 7
  }
]

The target_level field is what drives the gap calculation. If you self-assess at 5 and the target is 8, your gap score is 3.

Writing the Evaluator Agent

The Evaluator Prompt Template

Each evaluator handles one skill. The prompt template uses placeholders that the orchestrator fills at runtime:

You are evaluating the skill: {{skill_name}} (Category: {{skill_category}})

Target proficiency level: {{target_level}} / 10

Context available: {{context_file}}

Read the context file if it exists. If not, use only the information provided.

Score this skill on the following dimensions (each 1–10):

1. Current Proficiency — How strong is this skill right now?
2. Frequency of Use — How often does this skill appear in real work?
3. Gap vs. Target — How far is current proficiency from target_level? 
   (10 = very far, 1 = at or above target)
4. Urgency — How soon does this gap need closing?
5. Effort to Improve — How hard is it to improve this skill? 
   (10 = very hard, 1 = easy)

Return ONLY valid JSON in this format:
{
  "skill_id": "{{skill_id}}",
  "skill_name": "{{skill_name}}",
  "scores": {
    "proficiency": <number>,
    "frequency": <number>,
    "gap": <number>,
    "urgency": <number>,
    "effort": <number>
  },
  "evidence": "<1–2 sentences explaining your scores>",
  "top_recommendation": "<One specific action to improve this skill>"
}

Returning only JSON is critical. The aggregator needs to parse these files programmatically, and any prose mixed in breaks that.

Handling Skills Without Context Files

Not every skill will have a dedicated context document. Your evaluator prompt should handle this gracefully — instruct it to fall back to the skill name, category, and target level as its only evidence, and to note the absence of context in its evidence field.

This is better than skipping the skill or returning errors.

Building the Aggregator

Computing Composite Scores

The aggregator reads all files in outputs/scores/, computes a weighted composite for each skill, and sorts the list.

Here’s the aggregator logic in pseudocode (you can implement this as a Claude Code tool or a simple Python/JavaScript function):

weights = {
    "gap": 0.35,
    "frequency": 0.30,
    "urgency": 0.20,
    "proficiency_inverted": 0.15  # 10 - proficiency score
}

for skill in all_skills:
    composite = (
        skill.scores.gap * weights["gap"] +
        skill.scores.frequency * weights["frequency"] +
        skill.scores.urgency * weights["urgency"] +
        (10 - skill.scores.proficiency) * weights["proficiency_inverted"]
    )
    skill.composite_score = round(composite, 2)

ranked = sorted(all_skills, key=lambda s: s.composite_score, reverse=True)

Higher composite = higher priority to fix. The top of your ranked list is your biggest skill debt.

Generating the Ranked Report

The aggregator’s final step is producing a readable output. A good audit report has four parts:

Summary table — All 40+ skills ranked with their composite scores
Top 10 priorities — Detailed breakdown with evidence and recommendations
Category summary — Which skill categories have the most debt?
Quick wins — Skills with high gap scores but low effort scores (easy to fix, high impact)

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

The quick wins section is often the most actionable. It finds the skills where you’re behind target but improving is relatively straightforward — these should be your first focus.

Running the Audit at Scale

Concurrency Considerations

Forty parallel evaluations is well within what Claude Code can handle, but there are a few things to watch:

Token usage: Each evaluator makes an API call. Forty evaluations with 500-token prompts each is 20,000 input tokens plus output. Plan for this in your API budget.

Rate limits: If you’re running this against a production Claude API key, you may hit rate limits at very high concurrency. You can add a small concurrency cap (e.g., max 10 simultaneous evaluators) in your orchestrator without losing much speed benefit.

Context quality: Evaluations with rich context files are meaningfully better than those without. If you can spend 10 minutes writing even brief notes about each skill before running the audit, your scores will be more accurate.

Validating Output Quality

Before treating your ranked list as ground truth, do a sanity check:

Are the top 5 priorities ones that feel genuinely problematic?
Are any obvious gaps missing entirely?
Do any scores seem inflated or deflated compared to your intuition?

If something looks off, check the evidence field in that skill’s score file. The evaluator’s reasoning is there, and you can often spot where it made a wrong assumption.

You can also re-run specific skills with more context to get better scores without rerunning the whole audit.

Turning Scores into Action

A ranked list is only useful if it leads to behavior change. The audit’s last output should be an action plan, not just a spreadsheet.

The Three-Tier Action Plan

Organize your recommendations into three tiers:

Tier 1 — Fix now (top composite scores, especially high gap + high frequency) These are the skills most affecting your current work. Assign a specific improvement action and a deadline within 30 days.

Tier 2 — Fix this quarter (mid-range composite scores, high urgency) Schedule deliberate practice or learning time. These matter, but don’t require immediate intervention.

Tier 3 — Monitor (low composite scores or low frequency skills) Keep an eye on these. They’re not urgent, but reassess in your next audit cycle.

Connecting Recommendations to Resources

The top_recommendation field in each score file gives you a starting action. Your aggregator can go further by categorizing recommendations:

Practice-based: Needs hands-on work (e.g., build a project using this skill)
Course-based: Needs structured learning (e.g., take a course or read a book)
Mentorship-based: Needs feedback from someone better than you
Process-based: Needs to be built into your workflow (e.g., use this skill more regularly)

Tagging recommendations by type helps you batch similar improvement activities together.

Where MindStudio Fits for Teams Running Recurring Audits

If you’re a solo developer, the Claude Code workflow described above is sufficient. But if you’re running skill audits across a team — or want to automate this process on a recurring schedule — you need more infrastructure.

Other agents start typing. Remy starts asking.

YOU SAID "Build me a sales CRM."

REMY ASKS

01 DESIGN Should it feel like Linear, or Salesforce?

02 UX How do reps move deals — drag, or dropdown?

03 ARCH Single team, or multi-org with permissions?

Scoping, trade-offs, edge cases — the real work. Before a line of code.

This is where MindStudio’s Agent Skills Plugin becomes relevant. The @mindstudio-ai/agent npm SDK lets Claude Code (and other AI agents) call MindStudio workflows as typed method calls. That means your Claude Code orchestrator can hand off the aggregation step, report generation, and delivery to a MindStudio workflow running in the background.

Practically, this looks like:

const agent = new MindStudioAgent();

// After Claude Code finishes the parallel evaluations,
// call a MindStudio workflow to generate and distribute the report
await agent.runWorkflow("skill-audit-report", {
  scores: allSkillScores,
  team: "engineering",
  period: "Q3-2025"
});

That MindStudio workflow can then send the ranked report to Slack, write results to Airtable for tracking over time, or trigger a Notion task for each Tier 1 skill.

The split makes sense: Claude Code handles the reasoning-heavy evaluation work, MindStudio handles the operational delivery and record-keeping. You can start building on MindStudio free at mindstudio.ai.

Common Mistakes to Avoid

Over-scoring Everything as High Priority

If your rubric isn’t calibrated, you’ll end up with a top-10 list where everything scores 8/10 and the rankings feel meaningless. Force distribution by reviewing whether your weights are actually differentiating skills. If 30 out of 40 skills score above 7.5, tighten your gap criteria.

Skipping the Context Files

Running evaluations with no context means Claude is guessing based on the skill name alone. A single paragraph per skill — your current projects, recent struggles, where you use this skill — dramatically improves score accuracy.

Treating the First Audit as Definitive

Your first run establishes a baseline. The value compounds when you run it again in 90 days and see which skills you actually improved. Build the workflow so reruns are easy and results are stored with timestamps.

Ignoring the Quick Wins Layer

Most teams focus on Tier 1 (biggest problems) and ignore quick wins. But closing several easy gaps fast builds momentum and reduces overall skill debt faster than grinding away at the hardest problem.

FAQ

How many skills can a Claude Code dynamic workflow handle at once?

In practice, Claude Code can fan out to dozens of parallel sub-agents without issues. For 40–80 skills, parallel evaluation is straightforward. Above 100, you may want to add a concurrency limit to avoid rate limiting on the underlying model API. The fan-out / fan-in pattern scales well as long as each sub-agent task is well-defined and returns structured output.

Do I need to write code to use dynamic workflows in Claude Code?

Not much. The orchestrator and evaluator are primarily prompt files. The aggregation step benefits from a small amount of JavaScript or Python to sort and weight the scores, but you can also write this as a Claude Code task that processes the JSON files and outputs a sorted list. If you’re comfortable with basic scripting, this is manageable in under 50 lines.

How accurate are AI-generated skill scores?

The accuracy depends on the quality of context you provide. With detailed notes about your experience and specific examples of where you’ve struggled, scores are usually within one point of a careful self-assessment. Without context, the model is inferring from the skill name and category, which produces noisier results. The relative ranking across skills tends to be more reliable than any individual score.

What’s the difference between a dynamic workflow and a static pipeline?

A static pipeline has a fixed number of steps defined at design time. A dynamic workflow generates its steps at runtime based on input data. For a skill audit, this matters because the workflow adapts to however many skills are in your list — you don’t need to redesign the workflow when you add new skills.

Can I use this approach for team skill audits, not just individual ones?

Yes. The structure scales to teams by adding a team member dimension to your skills data. Each team member submits a self-assessment (or you pull it from performance review data), and the evaluator scores each skill for each person. The aggregator can then produce individual rankings, team-level gap analysis, and a view of where the whole team has shared blind spots. MindStudio’s workflow tools are particularly useful for this use case since you can automate collection, processing, and distribution at the team level.

How often should I run a skill audit?

Quarterly is a reasonable default for most knowledge workers. Monthly is useful if you’re in a rapid learning phase or onboarding to a new role. The audit itself, once your workflow is built, takes minutes to run — the effort is in reviewing results and updating your action plan.

Key Takeaways

Dynamic workflows fan out parallel skill evaluations, turning a multi-hour manual task into a minutes-long automated process.
The rubric drives everything — define your scoring dimensions and weights before you build, not after.
Structured JSON output from evaluators is non-negotiable. Consistent format enables programmatic aggregation and comparison over time.
Quick wins are underrated — high-gap, low-effort skills should be first in your action plan, not an afterthought.
Recurring audits create real value — a single audit is a snapshot; quarterly reruns reveal whether your learning is actually working.

If you want to extend this into a team workflow with automated reporting and tracking, MindStudio gives you the infrastructure to connect Claude Code’s reasoning layer to the operational tools your team already uses — free to start, no API keys required.