Sam Altman Says the Harness Is Now Inseparable from the Model — Here's What That Means for Builders
Sam Altman told Ben Thompson he can't always tell if a great Codex result came from the model or the harness. What builders need to know.
Sam Altman Admitted He Can’t Tell Where the Model Ends and the Harness Begins
Sam Altman told Ben Thompson something that should reframe how you think about every benchmark you’ve ever read. When asked how important the runtime around the model is — the tools, the state, the loop — Altman said: “Hard to overstate how critical it is. I no longer think of the harness and the model as these entirely separable things.” Then he admitted that when Codex does something impressive, he genuinely doesn’t know how much credit belongs to the model versus the infrastructure around it.
That’s not false modesty. That’s the CEO of OpenAI telling you that the thing you’ve been optimizing — model selection, prompt engineering, fine-tuning — is only half the equation. Maybe less.
If you’re building agents, this matters to you directly.
The Number That Changes How You Read Every Model Benchmark
Here is a concrete data point. Endor Labs ran the same model — GPT-5.5 — in two different harnesses during the same week. In the Codex harness, it scored 61.5% on functionality. In the Cursor harness, it scored 87.2%.
Same model. Same week. 25.7 percentage points of difference from the runtime alone.
Opus 4.7 showed a similar pattern: 87.2% in its native Claude Code harness, 91.1% in Cursor’s harness. Nearly four percentage points gained just by switching the environment the model operates in.
Remy is new. The platform isn't.
Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.
The Endor Labs team wrote the obvious conclusion: “Same model, same week, two harnesses, two different functional results.” But the implication is more uncomfortable than the headline. If harness choice can swing functionality scores by 25 points, then most model comparisons you’ve read — including the ones that informed your infrastructure decisions — were measuring harness performance as much as model capability. The two were never cleanly separated.
This is why Altman’s admission lands differently when you sit with it. He’s not being philosophical. He’s describing a measurement problem that affects every practitioner.
What a Harness Actually Is (and Why the Definition Matters)
Most people use “harness” loosely, often interchangeably with “framework.” They’re not the same thing.
A framework — LangChain, LangGraph, AutoGen — gives you abstractions. State machines, chains, memory connectors. You wire them together. The fundamental assumption is that you, the human architect, are assembling the pieces.
A harness is the opposite direction. It ships a working agent. There’s no assembly step. At its core, it’s a while loop with a tool registry and a permission layer, all pre-wired. You bring the goal; the harness handles the rest.
The distinction matters because it changes what you’re actually building when you choose one over the other. Frameworks are for architects. Harnesses are for agents.
Agentic coding tools — Cursor, Claude Code, Codex — are all harnesses. Each one started from a concrete problem (make a model write and edit code across a real repository) and converged on a remarkably similar architecture. That convergence is not accidental. It reflects what actually has to exist for an LLM to reliably complete multi-step tasks.
There’s a useful three-phase framing here. Phase one was the weights phase: better agents meant bigger models, more parameters, better training. Phase two was the context phase: you don’t always need to change the model, you can change what it sees — prompt engineering, RAG, chain-of-thought. Phase three, where we are now, is the harness engineering phase. The question shifted from “what should we tell the model?” to “what environment should the model operate in?”
Each phase layered on top of the previous one. Weights still matter. Context still matters. But the center of gravity has moved outward.
Nine Things That Have to Exist Inside a Working Harness
If you want to understand why harnesses produce such different results with identical models, you have to look at what’s actually inside them.
The while loop is the foundation. The model reads its system prompt, decides which tool to call, runs the tool, feeds the result back into context, and loops again. This continues until the model produces a text-only response or hits a maximum iteration cap. Everything else in the harness exists to support this loop.
Context management is where most naive implementations fail. Every turn, the conversation tree grows. Tool calls accumulate. You hit the context limit. The harness has to decide what to keep verbatim, what to summarize, and what to discard. Claude Code’s compaction strategy — keeping recent messages in full, summarizing older ones — is a specific design choice with real consequences for task coherence. Get this wrong and long tasks degrade badly.
Skills and tools are distinct. Tools are primitives: read a file, run bash, search code. Skills are organizational knowledge encoded in markdown files — your team’s conventions, your workflow patterns. Tools are universal; skills are specific to your context.
Hire a contractor. Not another power tool.
Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.
Sub-agent management handles tasks that are too large or too parallel for a single conversation thread. The harness spawns sub-agents with their own sessions, restricted tool sets, and focused system prompts. The pattern is: span, restrict, collect.
Session persistence is what makes a harness durable. Append-only JSON files — every message, every tool result, every compaction event written to disk as it happens. If the process crashes, you resume exactly where you left off. This is not glamorous engineering, but it’s what separates a demo from a production system.
System prompt assembly surprises most people. The system prompt is not a static string. It’s a pipeline that walks ancestor directories looking for specific instruction files — agents.md, claude.md — and injects them dynamically. Order matters: static content first, dynamic content second, or you break prefix caching and pay for it in latency and cost.
Lifecycle hooks are the extensibility layer. A pre-tool hook fires before any tool runs and can allow, deny, or modify the call. A post-tool hook fires after and can inspect results. This is how enterprises adopt harnesses without modifying the harness itself — they inject their compliance and observability logic through hooks.
Permissions and safety is the layer that makes the difference between a useful tool and a dangerous one. Modern harnesses classify commands dynamically: ls is read-only, rm needs full access. The harness figures this out by parsing the command string, not by trusting the model’s judgment.
The ninth component — built-in skills — is what makes a harness immediately useful out of the box. File operations, git commits, pull request creation, test execution. If your agent can’t read or edit files, it isn’t a coding agent. These primitives are non-negotiable.
The Cursor SDK and What “Harness as a Service” Actually Means
The Cursor SDK is the clearest recent example of what happens when a harness gets exposed as infrastructure. Cursor’s Li Robinson described it as a platform where you can “build local hackable agents with any model or ship products on top of managed cloud agents.” The SDK handles the harness, sandboxing, computer use, GitHub integration — all of it.
When Jack Driscoll embedded a Cursor agent directly into Gmail using the SDK, someone asked why this was different from just calling an LLM with tools. His answer was precise: “The biggest difference is that Cursor SDK isn’t just calling LLM with tools. It’s exposing the same coding agent runtime Cursor already uses. Repo context, edit, search, terminal workflow, streaming status, model choice, and local hosted execution.” Gmail and chat are just the intake layer. The SDK is what can actually operate on a codebase.
Tejas Vavery built a bug-catching agent that can work on a production codebase and see how the app is performing in its own browser window. The key insight from his demo: agents currently write code and hope it works. They can run tests, but tests don’t catch UI behavior, integration issues, or flows that depend on real browser state. Closing that feedback loop — letting the agent actually see the app — changes the reliability calculus entirely.
Plans first. Then code.
Remy writes the spec, manages the build, and ships the app.
Robert Brochery used the SDK to embed a Cursor agent in a Chrome plugin for IT triage, letting non-technical users dump code from the browser into a ticket instead of describing the bug in prose. Three different builders, three different surfaces, all using the same underlying harness.
This pattern — harness as a service — is appearing across the industry simultaneously. Anthropic launched Claude managed agents. Microsoft released hosted agents in Foundry, with Satya Nadella writing that “every agent will need its own computer.” OpenAI updated their agents SDK. These aren’t coincidental product launches. They reflect a shared recognition that the harness layer is now infrastructure, the same way compute and payment rails are infrastructure.
The Endor Labs benchmark results make the business case for this infrastructure investment concrete. When you’re choosing between harnesses, you’re not making a developer experience decision. You’re making a capability decision. For teams building on top of MindStudio, which offers 200+ models and 1,000+ integrations with a visual builder for chaining agents and workflows, the harness question is already partially answered — the orchestration layer is pre-built, and you’re composing on top of it rather than assembling it from scratch.
Skill Systems: The Harness Concept Applied to Business Workflows
The harness insight extends beyond coding agents. Once you understand that the environment around the model is doing significant work, you start designing that environment intentionally.
The most sophisticated version of this is what practitioners are calling skill systems. The idea: instead of building one monolithic skill that tries to do everything, you build small focused skills and wire them together with an orchestrator. Each skill does one thing well. The orchestrator manages the sequence, the handoffs, and the human-in-the-loop checkpoints.
A concrete example: a YouTube short-form clip system chains five modular skills — transcript extraction with word-level timestamps, clip selection scoring each moment across five categories, face-tracked reframe to portrait mode, illustrated editing with Remotion-generated animations timed to exact frames, and publishing with thumbnails and descriptions to scheduling tools. One prompt kicks it off. The system runs autonomously. The human reviews the outputs.
The key design principle is that each skill gets exactly the context it needs — nothing more. Sub-agents handle research and processing tasks that would otherwise overwhelm the main context window. The transcript skill is reusable: the same skill feeds the short-form video system and a newsletter creation system and an SEO content production system. Build it once, reuse it everywhere.
This is also where progressive context loading matters. Claude Code’s skill architecture uses three levels: level one loads only the YAML front matter (roughly 100 tokens) to identify the right skill; level two loads the full skill.md (1,000–2,000 tokens) when the skill is selected; level three loads reference files only when the specific request requires them. The harness is doing active context management, not just passing everything to the model and hoping.
For teams building production workflows, the PIV loop (plan-implement-validate) methodology combined with Jira MCP integration shows how harness thinking applies to software development processes. The Atlassian MCP server creates Jira tickets, updates issues, and posts comments automatically — the agent manages its own task state rather than requiring human administrative overhead. The harness is handling workflow orchestration, not just code generation.
Built like a system. Not vibe-coded.
Remy manages the project — every layer architected, not stitched together at the last second.
If you’re building full-stack applications from these kinds of workflows, Remy takes a related approach at the application layer: you write a spec — annotated markdown where prose carries intent and annotations carry precision — and it compiles into a complete TypeScript backend, SQLite database, frontend, auth, and deployment. The spec is the source of truth; the generated code is derived output. It’s the same insight applied one layer up: the environment you design around the model (or the compiler) does significant work.
What This Means for How You Build
The practical implication of Altman’s admission is that model selection and harness selection are now equally important decisions — and most teams are treating them very differently.
Model selection gets careful evaluation: benchmarks, cost analysis, latency testing, capability comparisons. If you want a rigorous look at how GPT-5.4 and Claude Opus 4.6 compare across coding and agentic tasks, that analysis exists and is worth doing.
Harness selection often gets treated as a tooling preference. “We use Claude Code because that’s what the team knows.” That’s a reasonable starting point, but the Endor Labs data suggests it deserves the same rigor as model selection. A 25-point swing in functionality scores is not a rounding error.
The Claude Code source leak revealing a three-layer memory architecture is a good example of why harness internals matter: the self-healing memory system using memory.md as a pointer index is a specific architectural choice that affects how the agent maintains coherence across long sessions. Understanding what’s inside the harness you’re using changes how you configure it.
There’s also a build-versus-buy question that’s becoming more concrete. OpenClaw gave builders a fully customizable harness — you controlled everything from the model to the system prompt to the tool dispatch to the agent loop. That control is valuable. But it also meant that every layer of the stack was yours to assemble, configure, and maintain. The harness-as-a-service products — Cursor SDK, managed agents, hosted agents in Foundry — pre-build those layers and let you focus on what’s specific to your use case.
The analogy to the PC era is imperfect but instructive. The hobbyist era of computing — where you assembled your own machine from a kit — didn’t last long, not because hobbyists were wrong to want control, but because pre-built machines made computing accessible to people who would never have assembled a motherboard. The productivity gains of the 1990s came from Dell desktops, not from more people learning to solder. The same dynamic is playing out in agent infrastructure.
For builders evaluating open-weight models for local AI workflows, the harness question is especially relevant: a strong model running in a weak harness will underperform a weaker model in a well-engineered environment. The benchmark you’re looking at may not be measuring what you think it’s measuring.
The Measurement Problem You Can’t Ignore
Here’s the uncomfortable conclusion. When you read that a model scored X% on a coding benchmark, you’re reading a joint measurement of the model and the harness it ran in. When you see that one model outperforms another, you don’t know how much of that gap is model capability and how much is harness engineering.
Altman’s admission isn’t a curiosity. It’s a description of the current state of the field. The people building the most capable agents — at Anthropic, at OpenAI, at Cursor — are investing heavily in harness engineering precisely because they’ve discovered that the environment around the model is doing more work than most people assumed.
The question for builders is not “which model should I use?” It’s “which model, in which harness, for which task?” Those are three separate variables, and collapsing them into one decision is how you end up with a capable model producing mediocre results.
The harness is not a wrapper around the model. It’s the other half of the system.