Your AI Agent Is Underperforming: Run This 4-Question Harness Audit Before Switching Models

Before You Touch the Model, Run This Audit

Most engineers reach for the model upgrade first. New version dropped? Swap it in. Benchmark looks better? Ship it. But a March 2026 paper from Chinua University found something that should change that reflex: the same model showed a 6x performance variation depending solely on the orchestration wrapper around it — not the weights, not the version, not the provider.

That means you might be paying for a better model when the problem is in your harness.

This post gives you a concrete 4-question audit you can run on any underperforming agent in under an hour. These four questions come directly from the ablation work in the Chinua paper and from Anthropic’s own internal practice. They won’t catch every problem, but they’ll catch the most common ones — and they’ll tell you whether switching models is even worth trying.

What You’re Actually Diagnosing

Before running the audit, it helps to have a mental model of what a harness is and why it matters so much.

Think of it this way: the raw LLM is a CPU. Powerful, but inert. It has no memory, no storage, no I/O on its own. Your context window is RAM — fast but limited. External databases are disk. Tool integrations are device drivers. And the harness is the operating system: it decides what the CPU sees, when it sees it, and what it’s allowed to do.

TIME SPENT BUILDING REAL SOFTWARE

95%

5% Typing the code

95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

When your agent underperforms, the instinct is to blame the CPU. But usually the problem is the OS.

The Chinua University paper made this concrete. They took a full harness for desktop automation — OS Symphony — and migrated its control logic into natural language. Same tools, same model, same task. Performance jumped from 30.4% to 47.2%. Runtime dropped from 361 minutes to 41 minutes. LLM calls collapsed from 1,200 to 34. The representation of the control logic was doing the work, not the model.

That’s the thing you’re auditing.

What You Need Before Starting

You don’t need special tooling to run this audit. You need:

Access to your agent’s execution logs or traces (raw, not summarized — more on why below)
A list of every tool your agent has registered
Your current control logic, whether it lives in Python, YAML, or a prompt
A rough sense of which tasks are failing or underperforming

If you’re using a platform like MindStudio to build your agent, you can pull tool usage stats and execution traces from the workflow dashboard. If you’re running custom orchestration, grep your logs for tool call frequency before you start.

One warning: do not summarize your traces before reviewing them. The Chinua/DSPy research found that removing raw traces dropped accuracy from 50% to 34%. Replacing them with summaries gave 34.9% — barely better than nothing. The signal lives in the raw details.

The 4-Question Audit

Question 1: What’s in your context window that doesn’t need to be there?

Your context window is RAM. Like RAM, it’s limited — and unlike RAM, filling it up doesn’t just slow things down, it actively degrades reasoning quality.

Start by printing out (or logging) the full prompt your agent receives at the beginning of a task. Read it like a new hire reading their onboarding doc. Ask: would a competent person need all of this to do the job?

Common offenders:

Full conversation history when only the last 2-3 turns matter
Entire tool documentation when the agent only uses 3 tools
Verbose system prompts that repeat the same instruction 4 different ways
Previous task outputs that aren’t relevant to the current task

The Chinua ablation found that a stripped-down harness used 1.2 million prompt tokens per sample versus 16.3 million for the full version — and got to the same result. That’s a 14x compute difference for identical outcomes.

If your context is bloated, trim it. Start with the stuff that’s been in there since you first built the agent and never questioned. Those are usually the worst offenders.

Now you have: a list of context elements to remove or condense.

Question 2: Which tools does your agent rarely use?

Pull your tool call logs. Sort by frequency. Look at the bottom of the list.

Other agents start typing. Remy starts asking.

YOU SAID "Build me a sales CRM."

REMY ASKS

01 DESIGN Should it feel like Linear, or Salesforce?

02 UX How do reps move deals — drag, or dropdown?

03 ARCH Single team, or multi-org with permissions?

Scoping, trade-offs, edge cases — the real work. Before a line of code.

If you have 20 registered tools and your agent uses 4 of them in 90% of runs, the other 16 are not neutral. They’re actively hurting you. Every tool in the registry is something the model has to reason about when deciding what to do next. More options means more cognitive load, more chances to pick the wrong one, and more tokens spent on tool selection.

Warel — an agent platform — removed 80% of their agent’s tools and got better results. That’s not a one-off. It’s consistent with how LLMs handle option sets: fewer, clearer choices outperform large menus.

The question to ask for each low-frequency tool: “If I removed this, what breaks?” If the answer is “nothing in the last 30 runs,” remove it. You can always add it back.

This is also a good moment to check whether any tools overlap in function. Two tools that do similar things (say, “search_web” and “browse_url”) force the model to make a judgment call it often gets wrong. Consolidate where you can.

Now you have: a pruned tool list with only the tools that actually get used.

Question 3: Are your verification or search loops making things worse?

This one is counterintuitive, and it’s where most builders push back.

The Chinua ablation tested verification loops and multi-candidate search — two features that feel like they should help. They didn’t. Verifiers hurt performance by -0.8 on SWEBench and -8.4 on OS World. Multi-candidate search hurt by 5.6 points.

Why? Because these loops assume the model can’t self-correct, so they add external checking mechanisms. But modern frontier models are often better at self-correction than the verification logic you’ve hand-coded. The verifier introduces a new failure mode (the verifier itself being wrong) while providing less benefit than you’d expect.

Ask yourself: when did you add this verification step? Was it because the model was making a specific, reproducible error? Or was it a precaution you added “just in case”?

If it’s the latter, run an A/B test. Remove the verification loop and compare outcomes on 20-30 representative tasks. You might find the loop was doing nothing — or actively interfering.

The same logic applies to retry loops, fallback chains, and any “if this fails, try that” scaffolding. Each one encodes an assumption about what the model can’t do. Those assumptions expire as models improve. Anthropic’s internal practice — what they call the subtraction principle — is to drop harness components the moment the model no longer needs them. Manis rewrote their harness five times in six months following this principle.

Now you have: a clear view of which structural components are load-bearing versus vestigial.

Question 4: Is your control logic written in code or in language?

This is the most surprising finding from the research, and the one most builders haven’t acted on yet.

The OS Symphony experiment didn’t change the model. It didn’t change the tools. It rewrote the control logic — the “if this, then that” decision-making — from Python into structured natural language. The result was a 17-percentage-point performance gain and an 88% runtime reduction.

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

The reason this works is that LLMs are trained on natural language. When your control logic is written in Python or YAML, the model has to parse and interpret code to understand what it’s supposed to do. When it’s written in natural language, the model reads it the same way it reads everything else — fluently.

Look at your current agent instructions. If they contain things like:

if task_type == "research":
    use_tool("web_search")
elif task_type == "coding":
    use_tool("code_interpreter")

…consider rewriting that as: “When the task involves finding information, use the web search tool. When the task involves writing or debugging code, use the code interpreter.”

This isn’t always a simple rewrite — sometimes the logic is genuinely complex and benefits from code’s precision. But for high-level routing, state management, and failure handling, natural language often outperforms code because the model can reason about it directly rather than executing it mechanically.

If you’re building agents that chain multiple steps or manage state across long tasks, this is worth a dedicated experiment. The Claude Code source leak revealing its three-layer memory architecture is a useful reference for how production systems handle this — the memory layer is written in markdown, not Python.

Now you have: a clear answer on whether your control logic representation is working for or against you.

The Failure Modes You’ll Actually Hit

“I removed tools and things broke.”

Expected. The goal isn’t to remove everything — it’s to find the tools that are registered but never called. If removing a tool breaks something, that tool was load-bearing. Put it back and document why.

“My traces are too long to read.”

Don’t read them all. Read the traces from failed runs only. The DSPy auto-optimization loop that scored 76.4% on TerminalBench 2 — the only auto-optimized system in a field of hand-engineered entries — worked by having Claude Opus read failed execution traces specifically. Failures are where the signal is.

“I rewrote the control logic in natural language and it got worse.”

This happens when the natural language version is vague where the code was precise. Natural language control logic needs to be specific. “Use the web search tool when the user asks about current events or information that might have changed recently” is better than “Use web search when appropriate.” Vague instructions in natural language are worse than precise instructions in code.

“I ran the audit and everything looks fine, but the agent still underperforms.”

Then it might actually be the model. But now you know that — and you have a cleaner harness to test the new model against. One finding from the DSPy research is worth keeping in mind here: a harness optimized on one model transferred to five other models and improved all of them. The harness is the reusable asset. If you’ve cleaned it up, a model swap will give you a cleaner signal.

For teams building on top of no-code platforms, OpenClaw best practices from 200+ hours of use covers similar pruning principles applied to a specific orchestration environment — worth reading alongside this audit.

What to Do After the Audit

Run the four questions in order. They’re roughly ordered by impact and ease:

Context window bloat is usually the fastest fix with the most immediate payoff.
Tool pruning takes an afternoon but often produces visible results.
Verification loop removal requires A/B testing — budget a few days.
Control logic rewriting is the biggest lift but has the highest ceiling.

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

After each change, test on the same set of representative tasks before moving to the next question. Don’t change multiple things at once — you won’t know what worked.

One thing worth building as you go: a record of what you removed and why. Harness engineering is iterative. The assumption that justified a verification loop in January might be invalid by March because the underlying model improved. Having a log means you can revisit decisions without starting from scratch.

If you’re building multi-agent systems where multiple agents hand off to each other, the audit applies at each agent boundary. Context bloat and tool sprawl compound across agents — a multi-agent setup using Paperclip and Claude Code is a good reference for how to structure those handoffs cleanly.

For teams thinking about how control logic scales into full application architecture, Remy takes a related approach at the app layer: you write a spec in annotated markdown, and the full-stack application — TypeScript backend, database, auth, deployment — gets compiled from it. The spec is the source of truth, not the generated code. It’s a different domain, but the same underlying insight: the representation layer matters more than most builders assume.

The broader point from the research is this: mature harness work looks less like building structure up and more like pruning it down. The instinct to add more — more tools, more verification, more fallbacks — is understandable, but it’s often wrong. The question isn’t “what should I add to make this work better?” It’s “what can I remove?”

That’s a different kind of engineering. It’s slower to get comfortable with, but the results from the Chinua and DSPy work suggest it’s where most of the remaining performance gains are hiding.

For agents where token efficiency matters as much as accuracy, the Opus plan mode approach for saving tokens in Claude Code is a practical complement to harness pruning — you can plan with a capable model and execute with a cheaper one once the harness is clean enough to support it.

Run the audit. See what you find. The model is probably not the problem.