Cursor SDK + GPT-5.5 Scores 87.2% vs Native Codex's 61.5% — The Harness Is the Bottleneck

GPT-5.5 Scored 61.5% in Codex. Then Someone Swapped the Harness.

The Endor Labs benchmark result is stark: GPT-5.5 running inside Cursor’s SDK hit 87.2% on the functionality test. The same model, the same week, running inside its native Codex harness scored 61.5%. That’s a 26-point gap — and the model didn’t change at all.

This is the benchmark result that should reframe how you think about model selection. You’ve probably spent time comparing GPT-5.5 vs Claude Opus 4.7, reading benchmark tables, trying to figure out which model to build on. But if a harness swap can move a score by 26 points, the model choice might be the second most important variable, not the first.

The Endor Labs report tested code for both functionality and security. On the security section, Cursor plus GPT-5.5 scored 23.5% — narrowly beating the previous leader, which was Cursor plus Opus 4.7 at 22.9%. Both of those scores were a few percentage points above what either model achieved in its native harness. The pattern held across both models and both test dimensions: the harness mattered, consistently.

What the Endor Labs Numbers Actually Show

To understand why this result is surprising, you need to know what a harness is.

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

When you use GPT-5.5 inside Codex, you’re not just using the model. You’re using the model plus everything around it: the agent loop that decides what to do next, the tool dispatch layer, the sandboxing, the context management, how errors get handled, how state persists between steps. That whole environment is the harness.

Endor Labs tested GPT-5.5 in two different harnesses. First, its native environment — OpenAI’s Codex. Second, Cursor’s SDK, which Cursor describes as a way to “build local hackable agents with any model or ship products on top of managed cloud agents.” Same model weights. Different runtime.

The functionality score went from 61.5% to 87.2%. The security score went from below both leaders to 23.5%, the new top result.

Alex Volkov from the Thursday AI podcast ran a separate validation on WolfBench AI, an entirely different coding benchmark, and found the same directional result: Cursor’s harness produced the strongest performance for GPT-5.5, and was roughly on par with Claude Code when running Opus 4.7.

Two independent benchmarks, two different methodologies, same conclusion.

For context on how GPT-5.5 and Opus 4.7 compare in their native environments, the GPT-5.5 vs Claude Opus 4.7 coding comparison covers real-world coding performance in detail — but those results are all native-harness numbers, which means they’re measuring a combination of model and harness that you can now partially decouple.

Why the Harness Gap Is This Large

The intuition for why harnesses matter this much comes from thinking about what an agent actually has to do.

A coding agent asked to implement a feature, run tests, and open a pull request isn’t just calling an LLM once. It needs to understand the repo structure. It needs to sequence steps. It needs to handle tool failures. It needs to decide when it’s done. Without a good harness, all of that coordination has to live inside a prompt — fragile, context-hungry, and easy to break.

With a well-built harness, persistent memory supplies context automatically. Skill files and code conventions are available without stuffing them into the prompt. The runtime sequences steps and handles failures. The model gets to focus on the actual reasoning task instead of also managing its own execution environment.

Sam Altman said in a recent interview with Ben Thompson: “Hard to overstate how critical it is. I no longer think of the harness and the model as these entirely separable things.” Ben Thompson finished the thought: “Was it the model that’s amazing or the harness that’s amazing?” Altman: “Yeah, exactly.”

The Cursor SDK is specifically designed around this insight. It exposes the same coding agent runtime that Cursor already uses internally — repo context, edit and search tools, terminal workflow, streaming status, model choice, and local or hosted execution. When Jack Driscoll built a demo embedding a Cursor agent directly into Gmail (the agent reads an email thread, edits code, streams results back into the chat window), he explained why this was different from just calling an LLM with tools: “Cursor SDK isn’t just calling LLM with tools. It’s exposing the same coding agent runtime Cursor already uses.”

That runtime is what the benchmark is measuring. And it’s apparently quite good.

What’s Buried in This Result

The obvious headline is “Cursor’s harness beats Codex’s harness for GPT-5.5.” But there’s something less obvious worth sitting with.

How Remy works. You talk. Remy ships.

YOU14:02

Build me a sales CRM with a pipeline view and email integration.

REMY14:03 → 14:11

Scoping the project

Wiring up auth, database, API

Building pipeline UI + email integration

Running QA tests

✓ Live at yourapp.msagent.ai

OpenAI built GPT-5.5 specifically with agentic tasks in mind — the model is described as focused on goal-driven prompting where you tell it what good looks like and it works backwards from that. It was designed to work well in Codex. And yet a third-party harness outperforms the native one by 26 points on functionality.

There are a few possible explanations. One is that Cursor has been iterating on harness quality longer and more intensively than OpenAI has — Cursor’s entire product is the harness, whereas for OpenAI, the harness is one part of a much larger product surface. Another is that the Codex harness is optimized for a different use case than pure coding benchmarks — Codex has been expanding toward non-technical knowledge workers, with a new onboarding flow that asks whether you work in finance, product, marketing, operations, sales, data science, design, or are a student. A harness tuned for a chief-of-staff use case (reviewing messages, tracking calendar action items) might not be the same harness that maximizes performance on Endor Labs’ security correctness benchmark.

This is actually the most interesting tension in the current Codex trajectory. OpenAI published a “Top 10 use cases for Codex at work” article where the number one use case is a “Chief of Staff” agent — something that reviews your messages, calendar, and tracks action items. That’s a very different task profile from the coding benchmark where Cursor’s harness dominates. OpenAI is making a bet that one interface for everyone is the right approach, rather than splitting technical and non-technical work the way Anthropic has with Claude Code and Claude Co-work. That bet might be correct for adoption, but it creates real tradeoffs in harness optimization.

The security benchmark result is also worth examining separately. Cursor plus GPT-5.5 at 23.5% beat Cursor plus Opus 4.7 at 22.9% on security correctness. These are close numbers, but the direction is notable: GPT-5.5 in Cursor’s harness is now the security benchmark leader. If you’re building something where security correctness in generated code matters — and most production applications should care about this — the harness choice is now part of your security posture, not just your performance profile.

For a broader view of how these models stack up across coding, reasoning, and document tasks, the GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro benchmark comparison gives useful baseline context — though again, those are native-harness results.

What People Are Already Building With the Cursor SDK

The benchmark result is one thing. What’s happening in the days after the SDK launch is another.

Tejas Vavery built a bug-catching agent that has access to a production codebase and a live browser window. The agent can see how the app is actually performing, not just run static tests. Vavery’s framing: “Right now, agents write code and hope it works. They can run tests, but tests don’t catch everything, especially UI behavior, integration issues, or flows that depend on real browser state.” Being able to see the app closes the feedback loop in a way that test suites alone can’t.

TIME SPENT BUILDING REAL SOFTWARE

95%

5% Typing the code

95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

Robert Brochery embedded a Cursor agent in a Chrome plugin for IT triage. Non-technical users can dump code from the browser directly into a ticket instead of trying to describe a bug in words and hoping the description is accurate enough to be actionable.

Jack Driscoll’s Gmail integration — where the agent reads an email thread, goes off and edits code, and streams results back — demonstrates something architecturally interesting: the Cursor SDK separates the intake layer (Gmail, a Chrome plugin, whatever interface you want) from the execution layer (the coding agent runtime). You bring the interface. The SDK brings the harness.

This separation is what makes the SDK structurally different from just calling an API. The Cursor SDK handles harness, sandboxing, computer use, demo videos, and GitHub integration. You bring the model, the tools, and the task. Everything underneath is pre-built by a team whose full-time job is making those layers work well.

Platforms like MindStudio take a similar composability approach at the workflow level — 200+ models, 1,000+ integrations, and a visual builder for chaining agents and tools — which means the harness-as-infrastructure pattern is showing up across the stack, from low-level SDKs to no-code builders.

The Abstraction Layer Question

There’s a broader pattern here that’s worth naming.

We’ve moved through several phases in how people build with AI. First, everything was about the model weights — bigger models, better training. Then it was about context — prompt engineering, RAG, few-shot examples. Now the center of gravity has shifted to the harness: the persistent memory, the tool dispatch, the execution sandbox, the error handling, the state management.

Each phase layered on top of the previous one. Weights still matter. Context still matters. But the benchmark result from Endor Labs is evidence that harness quality is now a first-class variable — one that can swing a functionality score by 26 points even when the underlying model is identical.

For developers thinking about where to invest, this suggests that choosing a harness is as important as choosing a model. And for teams building production applications on top of coding agents, the question “which model should we use?” is incomplete without “and in which harness?”

This abstraction trend extends further up the stack too. Tools like Remy take the next step: instead of writing TypeScript and wiring together a harness manually, you write an annotated spec — structured markdown where prose carries intent and annotations carry precision — and Remy compiles it into a complete full-stack application with TypeScript backend, SQLite database, auth, and deployment. The source of truth shifts from code to spec; the code becomes derived output.

The Cursor SDK benchmark result is a concrete data point in this larger story: the environment a model runs in is not a detail. It’s a design decision with measurable consequences.

What to Watch (and What to Do Now)

If you’re building with coding agents, the immediate action is straightforward: run your own benchmark on your specific task. The Endor Labs result is a benchmark on their specific test suite. Your production codebase, your task distribution, your definition of “correct” might produce different relative results. But the methodology — test the same model in multiple harnesses — is now clearly worth doing.

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

If you’re evaluating models for a new project, the Claude Code vs Codex comparison covers how these tools differ on parallel sessions, computer use, and browser integration — useful context for thinking about which harness environment fits your workflow before you’ve committed to one.

The Cursor SDK cookbook is publicly available on their GitHub. If you want to explore what’s possible without writing the orchestration from scratch, dropping that cookbook into Claude or ChatGPT with context about your project is a reasonable starting point for figuring out what you’d actually build.

The deeper watchpoint is what happens as more harness options proliferate. Right now we have Codex’s native harness, Cursor’s SDK, Claude Code, and a growing set of managed agent platforms. If the 26-point gap from Endor Labs holds up across other benchmarks and other models, we’re going to see a lot more attention paid to harness benchmarking specifically — not just “which model is best” but “which model in which harness, for which task type.”

That’s a more complicated question. It’s also a more honest one.

The Google Workspace MCP server — a connector for Gmail, Drive, Calendar, and Chat that works with either Codex or Claude Code — is a small example of how the harness ecosystem is expanding. The harness isn’t just the agent loop anymore. It’s the whole connective tissue between models and the tools people actually use. Getting that connective tissue right, as the Endor Labs numbers suggest, turns out to matter quite a lot.

Cursor SDK + GPT-5.5 Scores 87.2% vs Native Codex's 61.5% — The Harness Is the Bottleneck

GPT-5.5 Scored 61.5% in Codex. Then Someone Swapped the Harness.

What the Endor Labs Numbers Actually Show

Remy is new. The platform isn't.

Why the Harness Gap Is This Large

What’s Buried in This Result

How Remy works. You talk. Remy ships.

What People Are Already Building With the Cursor SDK

Coding agents automate the 5%. Remy runs the 95%.

The Abstraction Layer Question

What to Watch (and What to Do Now)

Related Articles

AI Model Routers Compared: Bifrost, LiteLLM, Portkey & More

Why Most Teams Overpay 40-85% for AI: The Routing Cost Math

Google AI Co-clinician vs GPT-5.4 Thinking: Which Medical AI Do Physicians Actually Prefer?

Mac Mini M4 Pro vs RTX 5090 vs DGX Spark: Which Local AI Hardware Is Right for You in 2026?