Agent Harnesses Beat Model Upgrades: 5 Benchmarks That Prove the Harness Is Now the Product

GPT-5.5 Scored 61.5% in One Harness and 87.2% in Another — Same Week, Same Model

The Endor Labs benchmark results landed quietly, but the numbers are hard to ignore. GPT-5.5, running in OpenAI’s native Codex harness, scored 61.5% on functionality tests. The same model, the same week, running in Cursor’s harness, scored 87.2%. That’s a 25.7 percentage point swing — not from a model upgrade, not from fine-tuning, not from better prompting. From switching the runtime environment around an identical model. If you’re still optimizing your stack primarily by chasing the latest model release, you may be optimizing the wrong variable.

The benchmark tested code for both functionality and security correctness. On the security side, GPT-5.5 in Cursor’s harness scored 23.5%, narrowly edging out Cursor plus Opus 4.7 at 22.9% — both a few points above what either model achieved in their native harnesses. Endor Labs summarized it plainly: “Same model, same week, two harnesses, two different functional results.”

This is the central question in agent infrastructure right now: does the harness matter more than the model? The data from Endor Labs suggests the answer is yes — at least in coding contexts, and possibly more broadly.

The 25-Point Gap That Changes the Conversation

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

The GPT-5.5 result is the most dramatic data point, but it’s not the only one. Anthropic’s Claude Opus 4.7 also improved when moved from its native Claude Code harness to Cursor’s harness — jumping from 87.2% to 91.1% on functionality. That’s nearly four percentage points from a harness swap alone.

To be clear about what’s being measured: these aren’t toy benchmarks. Endor Labs tests real code for whether it works and whether it’s secure. A 25-point functionality gap between harnesses means one runtime environment is producing code that works a quarter of the time more often than another. For a production engineering team, that’s not a marginal improvement — it’s the difference between a tool that ships and one that doesn’t.

Alex Volkov from the Thursday AI podcast independently confirmed similar findings on WolfBench AI, a separate coding benchmark. He found Cursor’s harness produced the strongest performance for GPT-5.5 and was competitive with Claude Code when running Opus 4.7. Two independent evaluations, same directional result.

What a Harness Actually Does (and Why It Changes Scores)

The word “harness” gets thrown around constantly without much precision. Here’s a concrete definition: a harness is the fixed architecture that turns a model into an agent. The model itself is a one-shot text generator — you ask, it answers, it stops. The harness is what gives it the ability to take action, observe consequences, and keep going until a task is actually done.

Think of the model as an engine and the harness as everything else in the car.

According to a detailed breakdown from @engineerprompt, a modern agent harness has nine distinct components: the while-loop (the outer iteration engine), context management (deciding what to keep, summarize, or discard as the conversation grows), skills and tools (the primitives the agent can call), sub-agent management (spawning parallel workers for large tasks), built-in skills (the baseline capabilities that ship with the harness), session persistence (writing state to disk so a crash doesn’t lose everything), system prompt assembly (dynamically building the system prompt from files like agents.md or claude.md), lifecycle hooks (pre- and post-tool injection points for custom logic), and permissions and safety (enforcing what the agent is allowed to do before any tool runs).

Each of those nine components is an opportunity for the harness to either help or hurt the model’s performance. Cursor’s harness, built by a team whose full-time job is making coding agents work, has spent years tuning each layer. A model dropped into that environment gets better context management, better tool dispatch, better error handling — all without any change to its weights.

What Cursor’s SDK Exposes

The Cursor SDK, released recently, makes this concrete. It exposes the same coding agent runtime Cursor uses internally: repo context, edit, search, terminal workflow, streaming status, model choice, and local or hosted execution. As Jack Driscoll put it after building with the SDK for a few days on pre-release: “The biggest difference in my opinion is that Cursor SDK isn’t just calling LLM with tools. It’s exposing the same coding agent runtime Cursor already uses.”

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

Driscoll’s demo embedded a Cursor agent directly into Gmail. From his inbox, he could share an email thread into chat, have the agent read it, go edit code in a repository, and stream results back — all without leaving Gmail. The harness is doing the heavy lifting; Gmail is just the intake layer.

Tejas Vavery built a bug-catching agent that can work on a production codebase while watching the app run in its own browser window. The key insight Vavery articulated: “Right now, agents write code and hope it works. They can run tests, but tests don’t catch everything, especially UI behavior, integration issues, or flows that depend on real browser state.” The Cursor harness closes that feedback loop. Robert Brochery used the same SDK to embed a Cursor agent in a Chrome plugin for IT triage — letting non-technical users dump browser context directly into a ticket instead of trying to describe a bug in words.

These aren’t demos of better models. They’re demos of better harness access.

The Three-Phase Evolution That Explains the Benchmark

Akshay, writing on Twitter, framed the current moment as the third phase of agent development. Phase one was the weights phase: better agents meant bigger models, more data, better training. Phase two was the context phase: the realization that you could change model behavior dramatically by changing what the model sees — prompt engineering, RAG, chain-of-thought. Phase three, where we are now, is the harness engineering phase.

In phase three, the question isn’t “what should we tell the model?” It’s “what environment should the model operate in?” The model stays the same. What changes is the task it’s being asked to solve, the tools it has access to, the memory it can draw on, and the runtime that sequences its steps.

Akshay’s concrete example: a coding agent asked to implement a feature, run tests, and open a PR. Without a harness, the model has to hold repo structure, project conventions, workflow state, and tool interactions all inside a fragile prompt. With a harness, persistent memory supplies context, skill files encode conventions, and the runtime sequences steps and handles failures. Same model. Completely different reliability.

The Endor Labs numbers are essentially empirical confirmation of this framework.

Sam Altman Can’t Tell the Difference Either

In a recent interview with Ben Thompson, Sam Altman was asked how important the harness is — the runtime around the model, the tools, the state. His answer: “Hard to overstate how critical it is. I no longer think of the harness and the model as these entirely separable things.”

Thompson pushed: was it the model that produced an impressive Codex result, or the harness? Altman: “Yeah, exactly.” He said he genuinely can’t always tell.

That’s a significant admission from the CEO of the company that built Codex. It’s also a useful frame for reading the Endor Labs benchmark. The 25-point gap between GPT-5.5 in Codex versus GPT-5.5 in Cursor isn’t just a data point about Cursor — it’s a data point about how much of what we attribute to models is actually attributable to the infrastructure around them.

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

remy.msagent.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

This matters for how you evaluate model releases. When a new model posts a strong benchmark score, the first question should be: which harness was it running in? The GPT-5.4 vs Claude Opus 4.6 comparison question gets considerably more complicated when the harness is a major variable.

Harness-as-a-Service Is the New Infrastructure Category

The Cursor SDK isn’t the only product in this space. Anthropic launched Claude managed agents. Microsoft released hosted agents in Foundry, with Satya Nadella writing that “every agent will need its own computer” — each getting its own dedicated enterprise-grade sandbox with durable state, built-in identity and governance, and support for any harness or framework. OpenAI made a significant update to their agents SDK.

These products are all playing in similar territory: selling access to agent runtimes the same way AWS sells compute and Stripe sells payment rails. The harness becomes infrastructure.

The difference from earlier approaches is meaningful. With something like OpenClaw, you had to wire everything yourself — pick the model, write the system prompt, define the tools, build the agent loop, manage context, handle errors, orchestrate sub-agents, figure out state persistence, decide where to deploy. Every layer was yours to assemble and maintain. With harness-as-a-service, the agent loop is pre-built. Tool dispatch is pre-built. Sandboxing, streaming, error handling, context compression — all pre-built and tuned by teams whose full-time job is making those layers work.

You bring three things: which model you want, what tools the agent has access to, and what task you’re handing it. Everything underneath is handled.

For builders thinking about how to structure agent workflows at scale, platforms like MindStudio take a similar approach to the orchestration layer — 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows without writing the orchestration code from scratch.

Skill Systems Show What Good Harness Design Enables

One concrete example of what harness infrastructure unlocks: skill systems. The concept is straightforward — instead of building one giant skill that tries to do everything, you build small focused skills and wire them together through an orchestrator.

A YouTube short-form clip system built on Claude Code skills chains five modular components: transcript extraction (word-level timestamped output), clip selection (scoring five clip-worthy moments across multiple categories), face-tracked reframe (detecting faces on every sampled frame and rendering to 9x16 portrait), illustrated editing (generating pop-out animations timed to exact keyword frames using Remotion), and publishing (packaging with thumbnail, title, description, and scheduling via Zioo). One prompt kicks it off. The orchestrator skill handles the chain. Sub-agents spin off at relevant points to keep context windows narrow.

The key design principle: each skill in the chain gets exactly what it needs, nothing more. Context management becomes the critical variable. And because the skills are modular, the transcript extraction skill feeds into the short-form video system and the newsletter creation system and the SEO content system — built once, reused everywhere.

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

This is what Claude Code’s progressive context loading is designed to support. Level one is the initial search: Claude reads only the YAML front matter of each skill (roughly 100 tokens) to find the right one. Level two loads the full skill.md (1,000–2,000 tokens) once the right skill is identified. Level three loads reference files only when the specific request requires them. The harness manages this loading sequence automatically. A mega-skill that tries to do everything blows up this design — everything loads at once, the model gets overwhelmed, output quality drops.

For teams building on top of these patterns, the Claude Code source leak revealing the three-layer memory architecture offers additional context on how the harness manages state across sessions.

The PIV Loop and What It Means for Harness Design

One practical methodology that’s emerged from harness-aware engineering is the PIV loop: plan, implement, validate. The idea is that the agent’s job isn’t just to write code — it’s to plan the implementation, execute it, and then validate its own work before handing control back to a human.

The Atlassian MCP server integration with Claude Code makes this concrete. The agent can create Jira tickets from a PRD, assign itself work, implement a feature, run type checking and linting and unit tests, and then post a comment to the Jira ticket with implementation details — all without a human touching the task management system. The harness handles the sequencing; the MCP server handles the integration.

This is the practical version of what Akshay described in the abstract: the harness sequences steps and handles failures. The model doesn’t need to hold all of this in a fragile prompt. The runtime manages it.

For builders evaluating agentic coding models, the question isn’t just raw benchmark performance — it’s how well the model performs inside the specific harness you’re actually going to use.

The Spec Layer Above the Harness

There’s a layer above the harness worth naming. Once you have a reliable agent runtime, the question becomes: what’s the source of truth for what the agent builds? For coding agents, this is increasingly a spec document — annotated markdown that carries intent and precision. Tools like Remy take this seriously: you write a spec, and it compiles into a complete full-stack application — TypeScript backend, SQLite database with auto-migrations, frontend, auth, tests, deployment. The spec is the source of truth; the generated code is derived output. The harness executes; the spec directs.

What the Benchmark Actually Tells You

The Endor Labs result — 61.5% to 87.2% for GPT-5.5, 87.2% to 91.1% for Opus 4.7 — isn’t an argument that models don’t matter. Weights still matter. Context engineering still matters. But the center of gravity has moved.

If you’re making infrastructure decisions based primarily on which model scores highest on a benchmark, you’re missing a major variable. The harness the model runs in is now a first-order consideration — not a secondary implementation detail.

The Hermes Agent framework and similar open-source alternatives are worth watching for the same reason: they’re competing on harness quality as much as on model access.

The benchmark data is clear. The same model in a better harness outperforms itself in a worse one by margins that would make any model upgrade look modest. That’s the finding. What you do with it is the engineering question.