Better Model vs. Better Harness — Which One Actually Moves Your Agent's Benchmark Score?
The same model shows up to 6x performance variation based solely on harness design. Here's the data on where to invest first.
The 6x Performance Gap You’re Ignoring Every Time You Switch Models
You’re debugging an underperforming agent. The model comparison spreadsheet is open. You’re about to upgrade from Claude Haiku to Opus, or swap GPT-5.4 for something newer. Before you do that: two papers published in early 2026 show the same model producing up to 6x performance variation depending solely on the wrapper around it — not the weights, not the prompt, not the model version. The harness.
That’s the finding you need to sit with before making any model purchasing decision.
The research comes from Pan et al. at Chinua University (March 2026) and a follow-up paper from Omar Khattab, the creator of DSPy. Together they formalize something practitioners have been noticing anecdotally for a while: the orchestration code surrounding your LLM is now the primary driver of agent performance. The model is almost secondary.
This post is about that specific question — model upgrade vs. harness improvement — and where the evidence points.
What the 6x Number Actually Means
The 6x figure isn’t from a cherry-picked edge case. It comes from controlled ablation experiments where researchers held the model constant and varied only the harness structure. Same weights, same task, same evaluation criteria. The performance gap between harness configurations was larger than the gap between most frontier model generations.
- ✕a coding agent
- ✕no-code
- ✕vibe coding
- ✕a faster Cursor
The one that tells the coding agents what to build.
To understand why, it helps to have a working mental model of what a harness is. The raw LLM is a one-shot text generator — you send tokens, it returns tokens, it stops. The harness is everything else: what goes into the context window, which tools are available, how state persists across turns, how sub-agents are spawned and managed, what happens when something fails. If you’ve used Claude Code, Codex, Cursor, or any agentic framework, you’ve been using a harness. The model inside each of those is often identical; the harness is what differs.
The operating system analogy is apt here. The LLM is the CPU — capable but inert without infrastructure. The context window is RAM: fast, limited, expensive to fill carelessly. External databases are disk. Tool integrations are device drivers. The harness is the OS, deciding what the CPU sees and when. A bad OS makes a fast CPU slow. A well-designed OS makes a modest CPU punch above its weight.
The Dimensions That Actually Separate Harness Quality from Model Quality
Before running a side-by-side comparison, it’s worth being precise about what you’re actually measuring when you evaluate either option.
Token efficiency. The Pan et al. paper found that a full harness configuration burned 16.3 million prompt tokens per sample, with 600+ tool calls and 32+ minutes of runtime. A stripped-down version of the same harness, running the same task to the same outcome, used 1.2 million tokens, 51 tool calls, and under 7 minutes. That’s a 14x compute reduction with no model change and no measurable quality loss. Token efficiency is a harness property, not a model property.
Benchmark score sensitivity. The OS Symphony benchmark is the clearest data point here. Researchers took a native code harness for desktop automation and rewrote its control logic in natural language — same underlying strategy, same model, different representation. Performance jumped from 30.4% to 47.2%. Runtime dropped from 361 minutes to 41 minutes. LLM calls collapsed from 1,200 to 34. A 17-percentage-point gain from a representation change alone.
Transferability. This is the finding that surprised me most. A harness optimized on one model transferred to five other models and improved all of them. The reusable asset isn’t the model — it’s the harness. If you’ve been treating your orchestration code as throwaway scaffolding around whichever model is currently winning benchmarks, you’ve been building the wrong thing.
Structural overhead. The ablation results from Pan et al. are counterintuitive. Verifiers — the kind of self-checking loops that feel like they should help — actually hurt performance: −0.8 on SWEBench, −8.4 on OS World. Multi-candidate search hurt by 5.6 points. More structure is not always better. The only module that consistently helped was self-evolution. Everything else was overhead that the model was already handling internally.
Feedback signal density. Khattab’s auto-optimization loop used Claude Opus 4.6 to read failed execution traces, diagnose failures, and rewrite the harness. The scale was 10 million tokens per iteration — 400x more feedback than prior methods. When researchers removed raw traces and replaced them with summaries, accuracy dropped from 50% to 34.9%. The signal lives in the raw details. Summarizing prior failures before feeding them back into the system actively degrades the optimization loop.
The Case for Switching Models
Model upgrades are the obvious move because they’re easy to reason about. Benchmarks are public. Comparing Claude Opus 4.6 vs GPT-5.4 on agentic tasks gives you a concrete number to point at. The upgrade path is usually a one-line config change.
Day one: idea. Day one: app.
Not a sprint plan. Not a quarterly OKR. A finished product by end of day.
There are legitimate cases where the model is the bottleneck. If your agent is failing on tasks that require reasoning capabilities the current model genuinely lacks — multi-step math, long-document synthesis, vision tasks — a stronger model will help. The SWEBench verified results in the Pan et al. paper clustered around 74–76% regardless of harness configuration when using GPT-5.4 at maximum reasoning, which suggests that at the frontier, some tasks do hit a model capability ceiling.
Model upgrades also make sense when you’ve already done the harness work. If you’ve stripped your context window down to what’s necessary, removed rarely-used tools, eliminated verification loops that were hurting rather than helping, and rewritten control logic in natural language — and you’re still not hitting your target — then yes, try a stronger model.
The problem is that most teams reach for the model upgrade first, before any of that harness work has happened. The 6x variation finding suggests that’s almost always the wrong order of operations.
There’s also a cost argument that cuts against frequent model upgrades. Claude Opus 4.7 vs 4.6 involves real token cost differences. If your harness is burning 16 million tokens per task when it could be burning 1.2 million, upgrading to a more expensive model multiplies a waste problem rather than solving it.
The Case for Harness Investment
The Khattab paper’s headline result is hard to dismiss: 76.4% on TerminalBench 2, the only auto-optimized system in a field of hand-engineered entries. On a 215-task text classification benchmark, the optimized harness scored 7.7 points above state-of-the-art using 4x fewer tokens. These aren’t marginal gains.
The Anthropic “subtraction principle” is the practical framing that makes this actionable. Every harness component encodes an assumption about what the model can’t do alone. Those assumptions expire as models improve. When Opus 4.6 stopped needing context resets, Anthropic dropped them. Manis, the agent platform, rewrote their harness five times in six months. Warel removed 80% of their agent’s tools and got better results.
Mature harness work looks less like building structure up and more like pruning it down.
The Haiku result is the most striking demonstration of this. In one experiment from the Khattab paper, a smaller model — Haiku — outranked Opus through harness optimization alone. The harness closed a model capability gap that would have cost real money to close by upgrading. If you’re running GPT-5.4 Mini vs Claude Haiku as sub-agents, the model comparison matters less than you think if the harness is the variable you haven’t controlled for.
The natural language representation finding deserves its own emphasis. Rewriting control logic from Python or YAML into structured natural language — with no other changes — produced the 17-point benchmark gain and the 88% runtime reduction. The representation itself drove the gain. This isn’t about prompting tricks; it’s about where the control logic lives and what form it takes.
Platforms like MindStudio handle the orchestration layer directly: 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows. The point isn’t to avoid writing control logic — it’s that the structure of that logic matters more than which model sits inside it.
A Practical Decision Framework
Here’s the order of operations the evidence supports.
Run the harness audit first. Four questions, in order:
-
What’s in your context window that doesn’t need to be there? Every token in context is RAM you’re spending. The stripped harness in the Pan et al. paper used 1.2M tokens vs 16.3M — same result.
-
Which tools does your agent rarely use? Unused tools add noise to the model’s decision space. Warel’s 80% tool removal result is the extreme case, but the direction is consistent across the literature.
-
Are you running verification or search loops? The ablation data is clear: verifiers hurt on SWEBench (−0.8) and OS World (−8.4). Multi-candidate search hurt by 5.6 points. If you added these because they felt like they should help, test removing them.
-
Is your control logic written in code or natural language? The OS Symphony migration — Python to natural language, same logic — produced the 17-point gain. If your agent’s decision-making is encoded in Python conditionals or YAML config, this is worth testing.
Not a coding agent. A product manager.
Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.
If you’ve worked through all four and you’re still underperforming, then you have a model problem.
On the spec-driven side of this: tools like Remy take a related approach to the abstraction question — you write a spec in annotated markdown, and the full-stack application (TypeScript backend, SQLite, auth, deployment) gets compiled from it. The spec is the source of truth; the code is derived output. It’s a different domain, but the underlying principle — that the representation layer you work in shapes the quality of what gets generated — is the same one the harness research is surfacing.
When model choice does matter. If you’re choosing between model families for a new agent, the Qwen 3.6 Plus vs Claude Opus 4.6 agentic coding comparison and similar benchmarks are useful for understanding capability ceilings. But treat them as a starting point, not a final answer. The harness you build around the model will determine whether you actually reach that ceiling.
For sub-agent selection specifically — the worker models in a multi-agent system — the model choice matters less than the task decomposition. A well-decomposed task given to Haiku will outperform a poorly-decomposed task given to Opus. The Khattab paper proved this empirically.
Where This Leaves You
The model comparison question isn’t going away. Frontier labs keep shipping improvements, and some of those improvements are real and matter for specific tasks. But the framing of “which model is best” has been the wrong question for a while now, and the Pan et al. and Khattab papers are the clearest evidence yet.
The 6x performance variation from harness design alone means that most underperforming agents have a harness problem, not a model problem. The 14x compute reduction from stripping unnecessary structure means most over-spending agents have a harness problem. The Haiku-beats-Opus result means the model hierarchy you’ve internalized is contingent on harness quality in ways that should make you skeptical of any benchmark that doesn’t control for it.
The harness optimized on one model transferred to five others. You build it once. It works across the model landscape. That’s the reusable asset. Treat it accordingly.
If you’re reaching for the model upgrade before you’ve run the four-question audit, you’re optimizing the wrong variable. The Claude Code effort levels question — how much reasoning to apply — is downstream of the same principle: the structure around the model shapes the output more than the model’s raw capability in most real-world configurations.
Start with the harness. The model can wait.