Skip to main content
MindStudio
Pricing
Blog About
My Workspace

Omar Khattab's DSPy Follow-Up: Auto-Optimized Harness Beats Every Hand-Engineered Agent on TerminalBench 2

The DSPy creator's new paper shows an auto-optimized harness hitting 76.4% on TerminalBench 2 — outscoring every hand-built entry in the field.

MindStudio Team RSS
Omar Khattab's DSPy Follow-Up: Auto-Optimized Harness Beats Every Hand-Engineered Agent on TerminalBench 2

Omar Khattab Built a System That Optimizes Its Own Agent Harness — and It Beat Every Hand-Engineered Entry on TerminalBench 2

Omar Khattab, the researcher who created DSPy, published a follow-up paper that takes the core DSPy insight — that prompts and pipelines should be optimized, not hand-tuned — and applies it to the entire agent harness. The result: an auto-optimized harness scored 76.4% on TerminalBench 2, the only automatically optimized system in a field of hand-engineered entries.

That number deserves a moment. Every other system in that benchmark was built by humans who carefully designed their orchestration logic, tool sets, and control flow. Khattab’s system read its own failure traces and rewrote itself.

If you build agents for a living, this paper is the one to track right now.

What the DSPy Follow-Up Actually Did

The setup is worth understanding precisely, because the mechanism is what makes this interesting — not just the score.

The optimization loop works like this: Claude Opus 4.6 reads failed execution traces from previous runs, diagnoses what broke, and rewrites a complete new harness. The final scores and raw traces accumulate in a growing file system. The loop repeats. Each iteration consumes roughly 10 million tokens — about 400 times more feedback signal than any prior method used. The system reads approximately 82 files per round.

The raw traces are not optional. When the researchers removed them, accuracy dropped from 50% to 34%. When they replaced traces with summaries instead of keeping the raw output, accuracy came back only to 34.9%. The signal lives in the uncompressed details of what actually happened during execution. Summarizing prior failures before feeding them back in actively hurts performance — a counterintuitive finding that has real implications for how you log agent runs.

The benchmark results are specific: 76.4% on TerminalBench 2. On a 215-task text classification benchmark, the auto-optimized harness scored 7.7 points above state-of-the-art while using 4x fewer tokens. That second number is the one that should catch your attention — better results with less compute, not more.

There’s also a model ranking result that inverts conventional wisdom. In one experiment, Haiku (the smaller, cheaper model) outranked Opus through harness optimization alone. The harness mattered more than the model tier.

Why This Is a Different Kind of Result

Most benchmark improvements come from one of three places: a better base model, more compute at inference time, or more careful prompt engineering. This result comes from none of those. The model weights didn’t change. The inference budget per task actually went down. And the prompts were written by the optimization loop, not by a human.

What changed was the harness — the orchestration layer that decides what the model sees, when it acts, which tools it can call, and how it handles failure.

The companion paper from Pan et al. at Chinua University (March 2026) had already established that this layer matters enormously. Their ablation study found that rewriting OS Symphony’s native code harness into natural language — same logic, different representation — lifted benchmark performance from 30.4% to 47.2%. Runtime dropped from 361 minutes to 41 minutes. LLM calls collapsed from 1,200 to 34. That’s a 17-percentage-point gain and an 88% runtime reduction from a representation change, not a model change.

Khattab’s paper asks the next question: if the harness matters this much, can you find the right one automatically? The TerminalBench 2 result suggests yes.

The operating system analogy from the Pan et al. paper is useful here. The raw LLM is the CPU — capable but inert without context. The context window is RAM. External databases are disk. Tool integrations are device drivers. The harness is the operating system, deciding what the CPU sees and when. You wouldn’t hand-tune an OS for every workload. You’d want it to adapt.

The Finding That Transfers

Here’s the result I’d highlight above the benchmark score: a harness optimized on one model transferred to five other models and improved all of them.

That’s a significant claim. It means the harness isn’t just a model-specific wrapper — it’s a reusable artifact that encodes something about how to structure the task, not just how to talk to a particular model. The reusable asset is the harness, not the weights.

This has practical implications for teams that swap models frequently. If you’ve been treating your orchestration layer as throw-away scaffolding that gets rebuilt every time you switch from Claude to GPT to Gemini, the transfer result suggests you’re leaving value on the table. A well-optimized harness might be worth carrying forward.

Other agents start typing. Remy starts asking.

YOU SAID "Build me a sales CRM."
01 DESIGN Should it feel like Linear, or Salesforce?
02 UX How do reps move deals — drag, or dropdown?
03 ARCH Single team, or multi-org with permissions?

Scoping, trade-offs, edge cases — the real work. Before a line of code.

It also reframes what “model evaluation” means. When you benchmark Claude Opus 4.6 against GPT-5.4, you’re not benchmarking the models — you’re benchmarking the harness-plus-model combinations. The comparison between Qwen 3.6 Plus and Claude Opus 4.6 on agentic coding illustrates exactly this: different harnesses, different scaffolding assumptions, different results. The model is one variable among several.

What’s Buried in the Ablation Results

The Pan et al. ablation study contains a finding that runs against the instinct of most agent builders: more structure actively hurts.

Verifiers — the modules that check an agent’s work before committing — reduced performance by 0.8 points on SWEBench and by 8.4 points on OS World. Multi-candidate search, where the agent generates several candidate solutions and picks the best, hurt by 5.6 points.

The full harness in their study used 16.3 million prompt tokens per sample, more than 600 tool calls, and over 32 minutes of runtime. The stripped-down version used 1.2 million tokens, 51 calls, and under 7 minutes — for the same result. That’s 14x the compute for identical output.

Anthropic has noticed the same pattern internally. They call it the subtraction principle: every harness component encodes an assumption about what the model cannot do alone, and those assumptions expire as models improve. When Opus 4.6 stopped needing context resets, Anthropic dropped them. Manis, the agent platform, rewrote their harness five times in six months. Warel removed 80% of their agent’s tools and got better results.

The craft of harness engineering, it turns out, is as much about removal as construction. The instinct to add verification loops, add tools, add search candidates — that instinct is often wrong.

This connects to a broader pattern visible in MiniMax M2.7’s self-evolving architecture: systems that improve through iteration tend to prune complexity rather than accumulate it. The optimization pressure pushes toward efficiency.

The Auto-Optimization Loop in Practice

The mechanism Khattab’s system uses is worth understanding at the implementation level.

Claude Opus 4.6 is the optimizer — it reads traces, diagnoses failures, and proposes harness rewrites. The harness being optimized is the full orchestration layer: what tools are available, how control flow is structured, how state is managed, how failures are handled. This isn’t prompt tuning. It’s rewriting the agent’s operating logic.

The 10 million tokens per iteration figure is the key to why this works. Prior methods gave the optimizer a thin slice of feedback — a score, maybe a few examples. This system gives it the raw execution traces of everything that went wrong, at full fidelity. The optimizer can see exactly where the agent got confused, which tool calls failed, where the context got polluted.

The 400x feedback multiplier over prior methods isn’t a marketing number — it’s the reason the raw traces matter so much. More signal, more specific signal, better rewrites.

For teams building agents on platforms like MindStudio, which supports 200+ models and 1,000+ integrations through a visual builder, the implication is that the orchestration choices you make in the builder — which tools to include, how to structure the flow, what to put in context — are the primary performance lever. The model selection matters less than the structure around it.

What This Means for How You Build

The practical takeaway isn’t “run DSPy’s auto-optimizer on your agent” — though that’s worth exploring if you have the infrastructure for 10M-token optimization loops. The practical takeaway is a different order of operations when something isn’t working.

Before you switch models, audit the harness. Four questions from the Pan et al. paper:

What’s in your context window that doesn’t need to be there? Context pollution is one of the most common performance drains. Every token the model reads that isn’t relevant to the current task is noise.

Which tools does your agent rarely use? Unused tools aren’t free. They consume context space and give the model more surface area to make wrong choices. Warel’s 80% tool removal result is the extreme version of this principle.

Are your verification or search loops actually helping? The ablation results say probably not. Test with and without. The answer might surprise you.

Is your control logic written in code or in natural language? The OS Symphony migration — same logic, rewritten from code to natural language — produced a 17-point benchmark gain. If your agent’s decision logic lives in Python conditionals and YAML configs, that’s worth examining.

The Haiku-outranking-Opus result is the sharpest version of this argument. A smaller, cheaper model beat a larger one not because of any capability difference, but because the harness around it was better optimized. If you’re reaching for a more expensive model tier when your agent underperforms, you may be solving the wrong problem.

This also has implications for how you think about the spec layer of agent construction. Tools like Remy treat the spec — annotated markdown — as the source of truth, compiling it into a full TypeScript stack. The harness optimization research suggests a similar principle applies to agents: the structured description of what the agent should do, and how, may be more load-bearing than the implementation details underneath it.

The Benchmark Caveat Worth Keeping

76.4% on TerminalBench 2 is a strong result. It’s also a benchmark result, which means it measures performance on a specific distribution of tasks under specific conditions.

The transfer result — harness optimized on one model improving five others — is more interesting as a signal of generalization. But “improved all of them” doesn’t tell us by how much, or whether the gains held on tasks outside the optimization distribution.

The honest read is: this is a compelling proof of concept that automatic harness optimization works and can beat careful human engineering on at least one benchmark. The mechanism (raw traces, high feedback volume, full harness rewriting) is specific enough to be reproducible. The transfer result is genuinely surprising.

What it doesn’t tell you is whether the same approach generalizes to your specific agent, your specific task distribution, or your specific infrastructure constraints. The 10M tokens per optimization iteration is not free. For teams without that budget, the manual version of this — systematic ablation, subtraction over addition, natural language control logic — is the accessible path to the same insight.

Everyone else built a construction worker.
We built the contractor.

🦺
CODING AGENT
Types the code you tell it to.
One file at a time.
🧠
CONTRACTOR · REMY
Runs the entire build.
UI, API, database, deploy.

The AutoResearch loop pattern that Karpathy described is structurally similar: run experiments, measure results, iterate. Khattab’s system applies that loop to the harness itself rather than to the task. The abstraction level is different; the underlying principle is the same.

For anyone building agents seriously in 2026, the DSPy follow-up paper is required reading. Not because you’ll immediately deploy auto-optimized harnesses, but because it makes the argument empirically that the harness is the primary engineering surface — more so than model selection, more so than prompt wording, more so than adding capabilities. The 76.4% score is the headline. The mechanism that produced it is the lesson.

The Claude Code effort levels research points in the same direction from a different angle: the configuration layer around the model — how much reasoning to apply, when to stop, what to prioritize — drives outcomes more than raw model capability. Khattab’s paper extends that argument to the full orchestration stack and shows you can optimize it automatically.

That’s the shift. The question was never which model is best. It was always which structure around the model is best — and now there’s a method for finding out.

Presented by MindStudio

No spam. Unsubscribe anytime.