Omar Khattab's DSPy Follow-Up: How Auto-Optimizing Your Harness Put a Tiny Model at #1 on TerminalBench

Omar Khattab Built a Self-Optimizing Harness and a Tiny Model Beat Everyone

Omar Khattab — the researcher who created DSPy — published a paper showing that an auto-optimized agent harness running Claude Haiku ranked #1 on TerminalBench 2, outperforming systems built around much larger models. Not by a little. The only auto-optimized system in a field of hand-engineered entries, scoring 76.4%. The model wasn’t the variable. The harness was.

This result matters because it’s not a benchmark trick. It’s a direct demonstration that the wrapper around your model — the control logic, the memory structure, the tool selection, the context management — now drives more performance variation than the model itself. You can swap Haiku in for Opus and win, if your harness is better.

Here’s what Khattab’s system actually did, why it worked, and what you should take from it if you’re building agents.

What the Result Actually Means (Before You Dismiss It)

The instinct when you see “small model beats large model” is to look for the asterisk. What’s the catch? Narrow benchmark? Weird task distribution?

TerminalBench 2 is a terminal-based task suite — the kind of environment where agents have to navigate real shell interactions, manage state across steps, and recover from failures. It’s not a reading comprehension test. It rewards systems that can plan, execute, observe, and adapt. Exactly the kind of task where you’d expect a larger model to have an edge.

Khattab’s system scored 76.4%. It was the only auto-optimized entry. Everything else was hand-engineered by teams who knew the benchmark and tuned for it. The Haiku-powered system beat them anyway.

On a separate 215-task classification benchmark, the same approach scored 7.7 points above state-of-the-art while using 4x fewer tokens. That’s not a marginal win — that’s a structural advantage.

The mechanism isn’t mysterious once you understand what the optimization loop was actually doing.

How the Auto-Optimization Loop Works

The core idea: use a capable model to read failure traces and rewrite the harness, then repeat.

Khattab’s system used Claude Opus 4.6 as the optimizer. The loop works like this: the agent runs a task, fails or partially succeeds, and the full execution trace gets written to disk. Opus reads those traces — raw, unprocessed — diagnoses what broke, and rewrites the harness. The new harness runs again. Scores and traces accumulate. The loop repeats.

The scale is what makes this work. Each iteration processes around 10 million tokens. That’s 400 times more feedback than prior automated optimization methods. The optimizer reads approximately 82 files per round — not summaries, not abstractions, the actual raw traces.

That last point is critical and the ablation proves it. When researchers removed the raw traces entirely, accuracy dropped from 50% to 34%. When they replaced raw traces with summaries — which seems like a reasonable compression — accuracy came back only to 34.9%. The signal lives in the details. Summarizing prior failures doesn’t preserve enough information for the optimizer to understand what actually went wrong. You need the full trace.

This is counterintuitive if you think about it from a cost perspective. Raw traces are expensive to process. But the alternative — cheaper summaries — costs you 15 points of accuracy. The optimization loop is only as good as the feedback signal you feed it.

Why the Harness Transfers Across Models

Here’s the finding that I think is underappreciated: the harness optimized on one model improved five completely different models when transferred to them.

Think about what this implies. The optimization loop isn’t learning something specific to Haiku’s weights or Opus’s reasoning patterns. It’s discovering something about the task structure — what information needs to be in context, what tools are actually useful, how failure modes should be handled, what the control flow should look like. That knowledge is model-agnostic.

The reusable asset is the harness, not the model. You build it once against one model, and it generalizes. This is a fundamentally different mental model than “train a model on your data” or “fine-tune for your use case.” The harness is the artifact you’re investing in.

This also explains why the Haiku result is possible. Haiku isn’t smarter than Opus in any absolute sense. But if the harness is doing the heavy lifting — managing context precisely, selecting the right tools, structuring the task correctly — then the model’s job becomes simpler. A simpler job is one a smaller model can do well.

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

If you’re comparing Claude Haiku against larger sub-agent models, the model benchmarks are only part of the picture. The harness around the model is doing work those benchmarks don’t measure. And if you’re weighing effort levels and compute allocation in agentic coding pipelines, the same principle applies: the structure of the harness shapes what the model actually needs to do at each step.

The Tsinghua Paper Provides the Theoretical Frame

Khattab’s result doesn’t exist in isolation. A paper from Pan et al. at Tsinghua University, published March 2026, formalized harness engineering as a discipline and provided the ablation data that explains why Khattab’s approach works.

The Tsinghua paper introduced a three-layer architecture: backend infrastructure at the bottom, a runtime charter in the middle (universal contracts for how state persists, how sub-agents are managed), and the natural language agent harness on top. The key insight is that separating these layers lets you run controlled experiments — swap the harness while fixing the runtime, and you’re testing harness design in isolation.

Their most striking result came from migrating OS-Synapse — a native code harness for desktop automation — into natural language representation. Same logic, same model, same task. Just rewritten in natural language instead of Python. Performance jumped from 30.4% to 47.2%. Runtime dropped from 361 minutes to 41 minutes. LLM calls collapsed from 1,200 to 34.

That’s a 17-point accuracy gain from a representation change alone. No new model, no new tools, no new data.

The ablation results are equally important for understanding what not to build. The full harness on SUB-bench burned 16.3 million prompt tokens per sample, made 600+ tool calls, and ran for 32+ minutes. A stripped-down version used 1.2 million tokens, 51 calls, and finished in under 7 minutes — with the same accuracy. 14x the compute for identical results.

More structure is not always better. The verifiers they tested actually hurt: -0.8 on SWE-bench, -8.4 on OS World. Multi-candidate search hurt by 5.6 points. The instinct to add more scaffolding, more checks, more redundancy — that instinct is often wrong.

What Anthropic Calls the Subtraction Principle

There’s a pattern across the best agent work right now that runs counter to how most people build. Anthropic has started calling it the subtraction principle: every harness component encodes an assumption about what the model can’t do alone, and those assumptions expire as models improve.

When Opus 4.6 stopped needing context resets, Anthropic dropped them. Manus rewrote their harness five times in six months as the underlying models changed. Warel removed 80% of their agent’s tools and got better results.

Mature harness work looks less like construction and more like pruning. The craft is knowing what to remove.

This is where most builders are getting it wrong. The reflex when an agent underperforms is to add: more tools, more verification steps, more context, more structure. But the Tsinghua ablations show that additions often hurt. The verifier that seems like it should help — catching errors before they propagate — actually degrades performance because it introduces noise, adds latency, and consumes context that the model could use for the actual task.

The question isn’t “what should I add to make this work?” It’s “what can I remove without losing accuracy?” Those are different questions and they lead to different systems.

Running Your Own Optimization Loop (What You Actually Need)

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

You don’t need Khattab’s exact infrastructure to apply this. The core loop is: run the agent, capture full execution traces, feed traces to a capable model, get harness rewrites, repeat.

A few things you need to get right:

Raw traces, not summaries. The ablation is unambiguous. If you compress the failure signal, you lose most of the information the optimizer needs. Store full traces. Yes, this is expensive. It’s worth it.

Enough iterations. A single pass won’t find the good harness. The optimization is finding structure in a high-dimensional space. You need enough iterations for the signal to accumulate. Khattab’s system ran at 10M tokens per iteration — that’s the scale that produced the result.

A capable optimizer. The optimizer model (Opus 4.6 in Khattab’s case) needs to be good enough to read a complex execution trace, identify the actual failure mode, and produce a coherent harness rewrite. This is a hard task. Don’t use a small model for the optimization loop itself.

Evaluation that’s honest. The optimization loop is only as good as your eval. If your benchmark is gameable, you’ll optimize toward gaming it. TerminalBench 2 is hard to game because it requires real task completion in a real environment. Your eval should have the same property.

For teams building on top of existing platforms, MindStudio handles the orchestration layer — 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows — which means you can focus on harness design rather than infrastructure plumbing. Getting the scaffolding out of the way is exactly the kind of subtraction that lets you iterate on what actually matters.

The Four Questions to Ask Before Touching Your Model

When an agent underperforms, the instinct is to switch models. That’s almost never the right first move. The Tsinghua and DSPy results together suggest a different order of operations.

Audit the harness first. Four questions:

What’s in your context window that doesn’t need to be there? Context bloat is one of the most common harness failures. Every token in context is a token the model has to process and potentially be distracted by. The stripped-down Tsinghua harness used 1.2M tokens instead of 16.3M and got the same result.

Which tools does the agent rarely use? Unused tools aren’t free. They consume context, they create decision surface for the model to navigate, and they can introduce failure modes when the model accidentally calls them. Warel’s 80% tool removal is an extreme case, but the direction is right.

Are your verification loops actually helping? The -8.4 on OS World from adding verifiers is a real number from a real ablation. Test your verifiers in isolation. Remove them and measure. You might be surprised.

Is your control logic in code or natural language? The OS-Synapse migration showed 17 points from this change alone. If your agent’s decision logic is encoded in Python conditionals rather than natural language instructions, you’re leaving performance on the table. The model is better at following language than navigating code-encoded rules.

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

These questions apply whether you’re running a DSPy optimization loop or hand-tuning a harness manually. The underlying principle is the same: the harness is the variable, and most of the gains come from removing things rather than adding them.

If you’re building the kind of system where the harness itself is specified as structured text — intent and rules written explicitly — tools like Remy take a similar approach at the application layer: you write an annotated spec in markdown, and the full-stack app (TypeScript backend, database, auth, deployment) gets compiled from it. The spec is the source of truth; the generated code is derived output. The abstraction level shifts, but the principle of “precise specification drives the system” is the same.

Where This Goes

The Haiku result on TerminalBench isn’t an anomaly. It’s a preview of how agent performance will be competed on going forward.

The model landscape is commoditizing. Haiku, GPT-5.4 Mini, Gemini Flash — these small models are genuinely capable now. The gap between them and frontier models on raw capability is narrowing. But the gap between a well-optimized harness and a poorly-designed one is not narrowing. If anything, it’s growing, because better models give you more room to make harness mistakes that compound.

The teams that figure out automated harness optimization — the DSPy loop, or something like it — will have a structural advantage that doesn’t depend on which model releases next quarter. They’ll take a new model, run the optimization loop, and have a tuned harness within days. Teams that are hand-engineering harnesses will spend months catching up.

The finding that a harness transfers across five different models is the key insight here. You’re not building for one model. You’re building a reusable artifact that works across the model landscape. That’s a different kind of investment than fine-tuning or prompt engineering, and it compounds differently.

For anyone thinking about how model choice interacts with agentic coding performance, the honest answer is: model choice matters less than you think, and harness design matters more than you think. The DSPy result is the clearest demonstration of that we’ve seen so far. It’s also worth reading how open-weight models like Qwen 3.6 Plus stack up in agentic workflows — because the same harness-transfer finding suggests that a well-optimized harness could close the gap there too.

The question isn’t which model to pick. It’s whether your harness is good enough to let a small model beat everyone else’s large one.