Harness Engineering Is Now a Formal Discipline: 6 Findings That Change How You Build AI Agents

A Research Team Just Formalized What Good Agent Builders Already Suspected

Six findings from a single paper published in March 2026 are enough to make you rethink every assumption you’ve held about what actually drives AI agent performance. Not model size. Not prompt cleverness. Not the number of tools you’ve wired in. The harness.

Pan et al. out of Chinua University dropped a paper this spring that treats harness engineering as a formal discipline — not a bag of tricks, not “prompt engineering with extra steps,” but a structured field with measurable variables, controlled ablations, and reproducible results. The numbers are specific enough to be uncomfortable if you’ve been spending your optimization budget on model upgrades.

Here are the six findings that matter, and what they actually mean for how you build.

The Paper’s Setup: What They Were Actually Testing

The Chinua University team asked a question that sounds almost too simple: what if you wrote an agent’s entire control logic not in Python, not in YAML, but in structured natural language?

Their architecture has three layers. At the bottom: the actual infrastructure and tools. In the middle: what they call the runtime charter — a universal physics layer that governs how contracts bind, how state persists, how sub-agents are managed. On top: the natural language agent harness itself, which holds state-specific logic including contracts, roles, state structure, and failure modes.

The separation is the point. By fixing the runtime and swapping only the harness, they could run clean ablations. Change one variable, measure the outcome. That’s not how most agent builders operate — most people change the model, the prompt, the tools, and the orchestration simultaneously, then wonder why they can’t tell what moved the needle.

The operating system analogy they use is worth keeping: the raw LLM is the CPU — powerful but inert. The context window is RAM. External databases are disk. Tool integrations are device drivers. The harness is the OS, deciding what the CPU sees and when. If you accept that framing, then obsessing over which CPU to buy while ignoring the OS is obviously wrong. Yet that’s what most of the “which model is best” discourse amounts to.

Finding 1: The Same Model Produced a 6x Performance Gap Depending Solely on the Harness

This is the headline number, and it’s worth sitting with. The same model. Six times the performance variation. The wrapper around the model — not the model weights — was the dominant variable.

If you’ve ever run the same prompt through Claude inside different coding environments and gotten noticeably different results, you’ve already observed this. The model is identical. The harness is not. What the paper does is quantify that gap rigorously, across benchmarks, with controlled conditions.

This has a direct implication for how you spend your time. If you’re hitting a wall with an agent and your first instinct is to upgrade the model, you’re probably pulling the wrong lever.

Finding 2: Rewriting Control Logic from Python to Natural Language Lifted Benchmark Performance by ~17 Points

The Chinua team took OS Symphony — a native code harness for desktop automation — and migrated its control logic into natural language representation. No other changes. Same tools, same runtime, same model.

Performance jumped from 30.4% to 47.2% on the OS Symphony benchmark. Runtime dropped from 361 minutes to 41 minutes. LLM calls collapsed from 1,200 to 34.

That last number deserves emphasis. 1,200 LLM calls down to 34 for the same task outcome. The representation itself drove the gain. Not a better model. Not more tools. Just rewriting the same logic in a form the model could reason about directly rather than having to interpret through code.

This connects to something broader about how LLMs process instructions. A model that’s been trained on natural language is going to navigate natural language control logic more efficiently than it navigates Python that encodes the same intent. The Python version forces the model to translate; the natural language version doesn’t.

Finding 3: More Structure Actively Hurt Performance

This is the finding that most builders will resist, because it runs against the instinct that drove most agent development over the past two years.

The Chinua team ran module-by-module ablations — stripping out components and measuring what happened. Self-evolution was the only consistently helpful module. Verifiers hurt: minus 0.8 on SWEBench, minus 8.4 on OS World. Multi-candidate search hurt by 5.6 points.

The full harness they tested burned 16.3 million prompt tokens per sample, with more than 600 tool calls and over 32 minutes of runtime. The stripped-down version used 1.2 million tokens, 51 calls, and ran in under 7 minutes — for the same result. That’s 14x the compute for identical output.

RWORK ORDER · NO. 0001ACCEPTED 09:42

YOU ASKED FOR

Sales CRM with pipeline view and email integration.

✓ DONE

REMY DELIVERED

Same day.

yourapp.msagent.ai

AGENTS ASSIGNEDDesign · Engineering · QA · Deploy

Anthropic has been calling this the subtraction principle, and the evidence for it is now in peer-reviewed form. Manis, the agent platform, rewrote their harness five times in six months. Warel removed 80% of their agent’s tools and got better results. The mature version of harness work looks less like building structure up and more like pruning it down.

The reason this happens is that every harness component encodes an assumption about what the model cannot do alone. When the model improves, those assumptions expire. A verifier that was necessary when the model made frequent errors becomes noise — and worse than noise, it introduces latency and token overhead that degrades the system. The component that was helping is now hurting, and you won’t know unless you test for it.

Finding 4: Omar Khattab’s Follow-Up Showed You Can Find the Right Harness Automatically

If the Chinua paper establishes that representation matters, the DSPy creator’s follow-up paper asks the next question: can you find the optimal harness without hand-engineering it?

The answer, based on their results, is yes — and the auto-optimized version beats hand-engineered entries.

The loop works like this: Claude Opus 4.6 reads failed execution traces, diagnoses what broke, and rewrites a complete new harness. Final scores and raw traces accumulate in a growing file system. The loop repeats. The scale is what makes it work — 10 million tokens per iteration, 400 times more feedback than any prior method, reading roughly 82 files per round.

The results on TerminalBench 2: 76.4% accuracy. The only auto-optimized system in a field of hand-engineered entries. On a 215-task text classification benchmark, it scored 7.7 points above state-of-the-art using four times fewer tokens.

One detail buried in the methodology is critical: raw execution traces are irreplaceable. When they removed traces entirely, accuracy dropped from 50% to 34%. When they replaced traces with summaries, it came back to 34.9% — barely better than nothing. The signal lives in the raw details. Summarizing prior failures before feeding them back into the optimization loop actively destroys the information the system needs to improve.

This has implications for anyone building feedback loops into their agents. If you’re compressing execution history to save tokens, you may be trading accuracy for efficiency in a way that isn’t worth it.

Finding 5: A Haiku-Class Model Outranked Opus Through Harness Optimization Alone

This is the finding that makes the “which model should I use” question feel almost quaint.

In one experiment from the Khattab follow-up, a smaller model — Haiku — outranked Opus on the benchmark. Not because Haiku is a better model. Because the harness optimized for Haiku’s characteristics outperformed the harness running Opus.

The implication is significant. If you’re paying for a frontier model and running it through a poorly structured harness, you may be getting worse results than a cheaper model with a well-optimized harness. The cost math changes entirely.

For teams building on platforms like MindStudio — which supports 200+ models and lets you swap between them without rewriting orchestration — this finding suggests a concrete workflow: optimize your harness on a cheaper model, then test whether the gains transfer to a more capable one before committing to the inference cost.

Finding 6: A Harness Optimized on One Model Transferred to Five Others

The reusable asset is the harness, not the model.

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

The Khattab paper tested whether a harness optimized on one model would improve performance on others. It transferred to five different models and improved all of them. That’s a significant finding for how you think about what you’re building when you engineer a harness.

Model weights are someone else’s asset. You don’t own them, you can’t modify them, and they’ll be superseded. A well-engineered harness is yours. It encodes your understanding of the task, the failure modes, the state structure, and the control logic. If that harness transfers across models, it’s durable in a way that model-specific prompt tuning is not.

This is also why the Claude Code source leak revealing a three-layer memory architecture was interesting to builders — it showed that the harness design decisions made by Anthropic’s own team are non-trivial and worth studying. The memory architecture isn’t incidental; it’s load-bearing.

What’s Actually Buried Here: The Discipline Is New, the Intuition Isn’t

The Chinua paper formalizes something that experienced agent builders have been doing by feel. The best agents in production — the ones that actually work — tend to be simpler than they look. The teams that built them went through multiple rounds of subtraction. They removed the verifier when it started hurting. They cut the tools the agent never used. They rewrote the control logic when the model got good enough to handle it directly.

What the paper adds is rigor. Controlled ablations. Reproducible benchmarks. A vocabulary — harness engineering — that lets you talk about this work as a discipline rather than a collection of heuristics.

That vocabulary matters because it changes how you scope the problem. If your agent is underperforming, the question isn’t “which model should I switch to?” It’s four questions: What’s in your context window that doesn’t need to be there? Which tools does the agent rarely use? Are your verification or search loops hurting rather than helping? Is your control logic written in code or in language?

The AutoResearch loop pattern that Karpathy described maps cleanly onto the Khattab optimization loop — both are about running experiments, measuring results, and iterating on the structure rather than the model. The difference is that Khattab’s version is doing it automatically, at scale, on harness design specifically.

What to Watch and What to Do

The immediate practical move is an audit. Take your worst-performing agent and run it through the four questions above before you touch the model selection. The Chinua ablation data suggests you’re more likely to find the problem in your verification loops or your tool inventory than in the model itself.

The longer-term implication is about where engineering effort should go. If harnesses transfer across models and the auto-optimization loop can find better harnesses than hand-engineering, then the skill that compounds is harness design — understanding what to put in, what to leave out, and how to structure control logic so the model can navigate it efficiently.

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

Tools like Remy are interesting to watch in this context: the idea of writing a spec as annotated markdown and compiling it into a full-stack application is a different instantiation of the same principle — the source of truth is the structured natural language document, and the derived output (code, in Remy’s case; agent behavior, in the harness case) follows from it. The abstraction layer is moving up.

The Chinua paper is worth reading in full if you’re building agents seriously. The benchmark numbers are specific, the ablations are clean, and the OS analogy is more useful than most frameworks for thinking about what a harness actually does. The field now has a name. The question is whether you’re engineering it deliberately or just accumulating complexity and hoping it helps.

Based on the ablation data, hope is not a great strategy here. The verifier you added last month might be costing you 8 points on OS World right now. The only way to know is to subtract it and measure.

If you want to go deeper on how model selection interacts with harness design, the Claude Mythos vs Opus 4.6 capability comparison is a useful read — but read it with the Chinua findings in mind. Capability differences between models matter less than the research suggests when the harness is the dominant variable. And for context on how self-optimization loops work at the model level, the MiniMax M2.7 self-evolving AI post covers a related pattern from a different angle.

The harness is the OS. Start treating it like one.