Rewriting Agent Control Logic from Python to Natural Language Cut Runtime from 361 to 41 Minutes

The 361-to-41-Minute Result That Changes How You Should Think About Agent Control Logic

OS Symphony’s desktop automation harness ran for 361 minutes to complete a benchmark task. After rewriting the control logic — same model, same tools, same task — runtime dropped to 41 minutes. LLM calls collapsed from 1,200 to 34. The only thing that changed was the language the control logic was written in: from native code to structured natural language.

That result comes from Pan et al. at Chinua University, published March 2026, and it’s one of the more concrete findings in what’s becoming a formal discipline: harness engineering. If you’re building agents and you’re still reaching for a model upgrade when performance disappoints, this paper is worth your attention.

The finding isn’t subtle. Rewriting the same control logic from Python into natural language — no architecture changes, no model swap — lifted OS Symphony’s benchmark score from 30.4% to 47.2%. That’s 16.8 percentage points from representation alone.

What “control logic in natural language” actually means

A harness is the scaffolding that turns a raw LLM into an agent. The model itself is a one-shot text generator. The harness is what gives it memory, tools, state, and the ability to loop. Think of it as an operating system: the LLM is the CPU, the context window is RAM, external databases are disk, tool integrations are device drivers, and the harness is the OS deciding what the CPU sees and when.

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

Most harnesses encode their control logic in Python or YAML. That means things like: when to retry, how to handle sub-agent spawning, what state to persist, how to route between tools. These are written as code because that’s what engineers reach for.

The Pan et al. paper asked a different question: what if you wrote all of that in structured natural language instead?

Their architecture has three layers. At the bottom: the actual infrastructure and tools. In the middle: a “runtime charter” — universal physics for how contracts bind, how state persists, how child agents are managed. At the top: the natural language agent harness, which holds state-specific logic, roles, contracts, and failure modes.

The separation matters because it enables clean ablation. You can swap the harness while fixing the runtime, which means you’re testing harness design in isolation. That’s how they got the OS Symphony result: same runtime, same tools, different representation.

Why natural language outperforms code for control logic

This is the part that requires some thought, because it’s counterintuitive. Code is precise. Natural language is ambiguous. Why would the ambiguous version perform better?

The answer is that LLMs are trained on natural language. When you write control logic in Python, the model has to parse the Python, infer intent, and then act. When you write it in natural language, the model is operating in its native medium. The instructions are more directly interpretable, which means less cognitive overhead per decision, which means fewer errors and fewer recovery loops.

The ablation data from the paper supports this. The full harness — written in code — burned 16.3 million prompt tokens per sample, with more than 600 tool calls and over 32 minutes of runtime. A stripped-down version using natural language control logic used 1.2 million tokens, 51 calls, and under 7 minutes. Same destination. Fourteen times the compute for the same result, just from the representation choice.

There’s a related finding that’s worth sitting with: verifiers hurt performance. The ablation showed −0.8 on SWEBench and −8.4 on OS World when verification modules were included. Multi-candidate search hurt by 5.6 points. More structure actively degraded results. This is Anthropic’s “subtraction principle” in empirical form: every harness component encodes an assumption about what the model can’t do alone, and those assumptions expire as models improve.

Manis, the agent platform, rewrote their harness five times in six months. Warel removed 80% of their agent’s tools and got better results. The craft here is subtraction as much as addition.

The auto-optimization result that extends this further

Omar Khattab — who built DSPy — published a follow-up paper that takes the natural language harness finding one step further: if representation matters this much, can you find the right harness automatically?

The answer appears to be yes. The system uses Claude Opus 4.6 to read failed execution traces, diagnose what broke, and rewrite a complete new harness. Final scores and raw traces accumulate in a growing file system, and the loop repeats. The scale is what makes it work: 10 million tokens per iteration, 400 times more feedback than any prior method, reading approximately 82 files per round.

The results on TerminalBench 2: 76.4% accuracy. The only auto-optimized system in a field of hand-engineered entries.

On a 215-task text classification benchmark: 7.7 points above state-of-the-art using 4x fewer tokens. And a smaller model — Haiku — outranked Opus in one experiment through harness optimization alone.

The finding that I think has the most long-term implications: a harness optimized on one model transferred to five other models and improved all of them. The reusable asset is the harness, not the model. You build it once; it works across the model landscape.

One detail from the traces work is worth highlighting. When they removed raw execution traces from the optimization loop, accuracy dropped from 50% to 34%. When they replaced traces with summaries, accuracy came back to only 34.9%. The signal lives in the raw details. Summarizing prior failures before feeding them back into the optimizer actively hurts performance — the model needs the unfiltered trace to diagnose correctly.

How to audit your own harness before touching the model

If you’re running an underperforming agent, here’s the order of operations the research suggests. Don’t switch the model first. Audit the harness.

Four questions to ask:

1. What’s in your context window that doesn’t need to be there?

Context bloat is the most common harness problem. Every token in context costs inference time and competes for the model’s attention. If you’re passing full conversation history, large tool schemas, or verbose system prompts, start cutting. The stripped harness in the Pan et al. experiment used 1.2M tokens versus 16.3M — a 13x reduction — for the same task outcome.

2. Which tools does your agent rarely use?

Tool schemas consume context. Tools the agent rarely invokes are pure overhead. Pull your logs, count tool call frequency, and remove anything below a threshold. Warel’s 80% tool removal result suggests the threshold is probably lower than you think.

3. Are your verification or search loops hurting?

The ablation finding here is stark: −8.4 on OS World from adding verifiers. If you have a verification loop, test without it. If you have multi-candidate search, test with a single candidate. The assumption that more checking equals better results doesn’t hold for current frontier models.

4. Is your control logic written in code or in natural language?

This is the migration question. If your agent’s routing logic, retry behavior, state management, and failure handling are encoded in Python or YAML, you have a candidate for the 17-point improvement. Rewrite the control logic as structured natural language — keep the runtime infrastructure in code, but move the decision logic into language the model can read directly.

For teams building agents at scale, platforms like MindStudio handle orchestration across 200+ models and 1,000+ integrations with a visual builder — which means the control logic is already expressed closer to natural language than raw Python, and you get the model-switching flexibility the harness transfer finding implies.

The practical migration path

If you want to test the natural language control logic hypothesis on your own agent, here’s a concrete approach.

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

Start by extracting your current control logic into a document. Write out, in plain English, what your agent does: when it retries, how it decides which tool to call, what it does when a sub-task fails, how it manages state across turns. Don’t worry about being formal — just write it as you’d explain it to a new engineer.

Then restructure that document into three sections: role (what this agent is and what it’s responsible for), state (what information it tracks and how), and failure modes (what to do when specific things go wrong). This maps roughly to the Pan et al. architecture’s top layer.

Replace your Python control logic with this document as a system prompt or harness instruction. Keep your tool definitions and runtime infrastructure in code — those benefit from precision. Move only the decision logic into language.

Run your benchmark. Compare token counts, tool call counts, and runtime. The OS Symphony result suggests you should see improvement; the magnitude will depend on how much of your current harness is encoding assumptions the model no longer needs.

One thing to track carefully: raw execution traces. The Khattab paper finding about trace summarization is directly applicable here. If you’re feeding summarized failure logs back into your optimization loop, you’re losing signal. Keep the raw traces.

This kind of iterative harness refinement is also where the abstraction level of your tooling matters. Remy takes a related approach at the application layer — you write a spec in annotated markdown, and it compiles a complete TypeScript backend, database, auth, and deployment from that spec. The spec is the source of truth; the generated code is derived output. It’s a different domain, but the underlying principle is the same: expressing intent in a higher-level language and letting the compiler handle the implementation details.

What breaks and how to fix it

The model ignores natural language control logic and falls back on defaults.

This usually means the control logic isn’t specific enough. Natural language needs to be structured, not conversational. Use explicit headers, numbered steps, and conditional statements (“if X, then Y, otherwise Z”). Vague instructions get interpreted loosely; precise instructions get followed.

Performance improves on some tasks but degrades on others.

This is expected during migration. The natural language harness is better for tasks where the model’s native reasoning helps; the code harness may be better for tasks requiring strict sequencing. Run ablations on task categories, not just aggregate benchmarks.

Token counts go up instead of down after the migration.

Check your natural language control logic for verbosity. The goal is precision, not length. A 200-word control logic document that covers all cases is better than a 2,000-word document that repeats itself. Also check whether you’ve accidentally moved tool schemas into the natural language section — those should stay in code.

The harness doesn’t transfer to other models.

The Khattab paper found transfer across five models, but that was after optimization on one. If you’re hand-writing the harness rather than auto-optimizing it, you may need model-specific tuning. Start with the model you care most about, then test transfer. If you’re comparing Claude Opus 4.6 versus other models for agentic tasks, the harness quality will matter more than the model choice for most tasks.

Auto-optimization loops are expensive.

RWORK ORDER · NO. 0001ACCEPTED 09:42

YOU ASKED FOR

Sales CRM with pipeline view and email integration.

✓ DONE

REMY DELIVERED

Same day.

yourapp.msagent.ai

AGENTS ASSIGNEDDesign · Engineering · QA · Deploy

The Khattab system used 10 million tokens per iteration. That’s not a weekend experiment. If you’re doing manual harness optimization, the four-question audit above is a cheaper starting point. Save the auto-optimization approach for production systems where the compute cost is justified by the performance gain.

Where this research points

The Pan et al. and Khattab papers together suggest that harness engineering is becoming a distinct skill — separate from prompt engineering, separate from model selection, separate from tool integration. The benchmark results are specific enough to take seriously: 361 to 41 minutes, 1,200 to 34 LLM calls, 30.4% to 47.2% on OS Symphony.

The model-switching instinct is understandable. Models are the visible variable. But if the same model shows 6x performance variation depending on the harness wrapper, and if a harness optimized on one model transfers to five others, then the harness is the more durable investment.

The subtraction principle is probably the most actionable takeaway for working engineers. Before you add another tool, another verification loop, another retry mechanism — check whether the model already handles that case. The assumption that more structure equals better performance was reasonable two years ago. The ablation data suggests it’s no longer reliable.

If you want to go deeper on the model-versus-harness question, the effort level tradeoffs in Claude Code are a concrete place to see how reasoning budget interacts with task complexity — another dimension of the same underlying question about where to invest compute. And if you’re thinking about how to reduce the cost of running agents at scale, routing through open-weight models for specific subtasks is a natural complement to harness optimization: get the harness right, then optimize the model routing.

The 361-to-41-minute result is reproducible in principle. The question is whether you’re willing to audit what’s in your harness before reaching for a bigger model.