Skip to main content
MindStudio
Pricing
Blog About
My Workspace

Cursor SDK vs Claude Code Harness: Which One Gets More Out of Your Model?

Opus 4.7 scores 91.1% in Cursor vs 87.2% in Claude Code's own harness. The harness gap is now bigger than the model gap.

MindStudio Team RSS
Cursor SDK vs Claude Code Harness: Which One Gets More Out of Your Model?

Opus 4.7 Scores Higher in Cursor Than in Its Own Native Harness

Claude Opus 4.7 running inside Cursor’s harness outperforms the same model running inside Claude Code’s native harness — on functionality benchmarks, by nearly four percentage points. That’s not a rounding error. That’s the harness doing more work than the model upgrade.

The specific numbers: Endor Labs ran Opus 4.7 through both environments in the same week. In Claude Code’s native harness, it scored 87.2% on functionality. Switch to Cursor’s harness, same model, same week: 91.1%. The Cursor SDK vs Claude Code native harness comparison isn’t close, and the implications for how you build are significant.

If you’re choosing between Claude Code and Cursor for your coding agent work, you’re not just choosing a UI. You’re choosing a runtime.

The Numbers That Shouldn’t Be Possible

Endor Labs published a benchmark comparing coding agents on both security correctness and functionality. The security results were interesting — GPT-5.5 in Cursor’s harness scored 23.5%, narrowly edging out Opus 4.7 in Cursor at 22.9% — but the functionality results were the real story.

GPT-5.5 scored 61.5% on functionality when running in its native Codex harness. The same model, the same week, running in Cursor’s harness: 87.2%. That’s a 25.7 percentage point swing from a harness change alone. No model update. No fine-tuning. Just a different runtime.

Everyone else built a construction worker.
We built the contractor.

🦺
CODING AGENT
Types the code you tell it to.
One file at a time.
🧠
CONTRACTOR · REMY
Runs the entire build.
UI, API, database, deploy.

Opus 4.7’s jump was smaller in absolute terms — 87.2% to 91.1% — but it’s arguably more striking. This is Anthropic’s own model performing better in a competitor’s harness than in Anthropic’s own. Claude Code is a well-engineered product. The Cursor harness is just better at extracting performance from Opus 4.7 on this benchmark.

Alex Volkov from the Thursday AI podcast confirmed similar findings on WolfBench AI: Cursor’s harness had the strongest performance for GPT-5.5 and was on par with Claude Code when running Opus 4.7. This isn’t a one-off result.

Sam Altman told Ben Thompson something that now reads as an understatement: “Hard to overstate how critical [the harness] is. I no longer think of the harness and the model as these entirely separable things.” He added that when something impressive comes out of Codex, he genuinely can’t always tell whether to credit the model or the harness. That’s a remarkable admission from the CEO of OpenAI.

Why This Is Non-Obvious

Most people still think about AI capability as a model property. You pick GPT-5.5 or Opus 4.7 based on benchmark scores, price, and context window. The harness is treated as infrastructure — necessary but neutral.

That framing is wrong, and the Endor Labs data proves it.

The harness is the environment the model operates in. It controls what context the model sees, when it sees it, how tool calls are dispatched, how errors are handled, how state persists across turns, and how the agent loop decides when to stop. A model running in a weak harness is like a skilled surgeon operating with the wrong instruments. The capability is there; the environment isn’t surfacing it.

Akshay’s three-phase framework for agent evolution maps this well. Phase one was about weights — bigger models, better training. Phase two was about context — prompt engineering, RAG, few-shot examples. Phase three, where we are now, is about harness engineering. The question shifted from “what should we tell the model?” to “what environment should the model operate in?”

The Cursor SDK announcement crystallized this. Cursor’s Li Robinson described it as a platform to “build local hackable agents with any model or ship products on top of managed cloud agents.” What Cursor is actually selling is access to the same coding agent runtime they use internally — repo context, edit, search, terminal workflow, streaming status, model choice, and local or hosted execution. That’s not a wrapper around an API. That’s a production-grade harness made available as a service.

Jack Driscoll, who built with the SDK during pre-release, put it directly: “The biggest difference in my opinion is that Cursor SDK isn’t just calling LLM with tools. It’s exposing the same coding agent runtime Cursor already uses.” He built a Cursor agent embedded directly in Gmail — read an email thread, dispatch the agent to edit code, stream results back into the chat window. The harness is what makes that possible. Without it, you’re back to stitching together API calls.

This is also why the Claude Opus 4.7 vs 4.6 comparison is incomplete without harness context. The model improvements are real, but the harness you run it in can dwarf those improvements in practice.

What the Cursor Harness Actually Does Differently

To understand why Cursor’s harness outperforms Claude Code’s native harness on these benchmarks, you need to understand what a harness actually contains.

There are nine components in any serious agent harness: the while-loop (the outer iteration engine), context management, skills and tools, sub-agent management, built-in skills, session persistence, system prompt assembly, lifecycle hooks, and permissions and safety. Claude Code has all of these. So does Cursor’s harness. The difference is in the tuning.

Context management is the most likely culprit for the performance gap. On every turn, the context window grows — user messages, tool calls, results. The harness has to decide what to keep verbatim, what to summarize, and what to discard. Claude Code’s compaction triggers around 80-90% of the context limit. Cursor’s harness makes different choices about what stays and what goes, and those choices affect what the model can reason about on any given turn.

System prompt assembly is another lever. The system prompt isn’t a static string in a well-built harness — it’s assembled dynamically by walking ancestor directories for agents.md and claude.md files, injecting relevant context based on the current task. The order matters: static content first, dynamic content second, to preserve prefix caching. Small differences in how this assembly works can meaningfully affect model behavior.

Session persistence — append-only JSON files that let the agent resume exactly where it left off — affects long-running tasks. If the harness loses state on a crash or compaction event, the model starts over with degraded context. Cursor’s harness has clearly invested heavily here.

The Cursor SDK also exposes lifecycle hooks: pre-tool hooks that fire before any tool runs (and can allow, deny, or modify the call) and post-tool hooks that run after for auditing. These hooks let you inject custom logic without touching the harness itself. That’s not just a developer convenience — it’s how enterprises adapt harnesses to their specific security and compliance requirements.

Tejas Vavery’s demo illustrates the practical upside: he built a bug-catching agent that works on a production codebase and can see how the app is performing in its own browser window. The agent writes code, runs it, observes the result, and iterates — closing the feedback loop that currently makes human verification a bottleneck. That kind of agent requires a harness that can manage browser state, coordinate tool calls, and maintain coherent context across a long session. Claude Code can do this. Cursor’s harness does it differently, and apparently better on the Endor Labs tasks.

Robert Brochery embedded a Cursor agent in a Chrome plugin for IT triage — non-technical users can dump code from the browser into a ticket instead of describing the bug in prose. Again: the harness is doing the heavy lifting. The model is the engine; the harness is the car.

For teams building multi-agent systems, the Claude Code agentic workflow patterns post covers how these patterns compose — but the harness choice upstream of those patterns shapes what’s actually achievable.

What This Means for How You Build

The immediate practical question: should you switch from Claude Code to Cursor?

Not necessarily. The Endor Labs benchmark tests specific tasks — security correctness and functionality on coding problems. Cursor’s harness is optimized for coding agents. If your use case is coding, the benchmark is directly relevant. If you’re building something else — document processing, research agents, customer-facing workflows — the results may not transfer.

The more important takeaway is that harness selection is now a first-class engineering decision, not an afterthought. When you’re evaluating models for a new project, you should be evaluating model-harness combinations. Comparing GPT-5.4 and Claude Opus 4.6 on raw benchmark scores misses half the picture if you’re not controlling for harness.

The Cursor SDK launch also signals something structural: harness-as-a-service is becoming a real category. Microsoft announced hosted agents in Foundry with Satya Nadella writing that “every agent will need its own computer.” Anthropic launched Claude managed agents. OpenAI updated their agents SDK. These aren’t coincidental. The infrastructure layer for agents is being productized, and the companies building it are competing on harness quality, not just model quality.

For builders who want to compose across models and harnesses without writing the orchestration layer from scratch, platforms like MindStudio handle this at a different level of abstraction — 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows, so the harness question becomes a configuration choice rather than an engineering project.

The skill system concept is relevant here too. A five-skill YouTube short-form clip system — transcript extraction, clip selection, face-tracked reframe, illustrated editing, publishing — runs fully autonomously because each skill is modular and the orchestrator skill wires them together. That architecture only works if the underlying harness manages context correctly across the chain. A harness that compacts aggressively might drop the transcript before the editing stage needs it. A harness that preserves too much context might overwhelm the model by the time it reaches publishing. The tuning matters at every step.

Claude Code’s progressive context loading for skills is elegant: level one reads only YAML front matter (~100 tokens), level two loads the full skill.md (1,000-2,000 tokens), level three pulls reference files only when needed. Cursor’s harness makes different tradeoffs. Neither is universally correct — the right choice depends on your task distribution.

If you’re building production coding agents today and you haven’t benchmarked your model against both Claude Code and Cursor’s harness, you’re leaving performance on the table. The Endor Labs data is specific enough to act on: for Opus 4.7, the Cursor harness extracts roughly 4 percentage points more functionality. For GPT-5.5, the gap is 25 points. Those aren’t marginal differences.

The sub-agent model comparison between GPT-5.4 Mini and Claude Haiku is another place where harness context matters — the sub-agent’s performance depends heavily on what context the orchestrating harness passes down to it.

One more consideration for teams building full-stack applications on top of these agents: Remy takes a different approach to the abstraction stack — you write a spec in annotated markdown, and it compiles into a complete TypeScript backend, SQLite database, frontend, auth, and deployment. The spec is the source of truth; the generated code is derived output. That’s a different layer than harness engineering, but the underlying principle is similar: the environment you give the model shapes what it produces.

Other agents start typing. Remy starts asking.

YOU SAID "Build me a sales CRM."
01 DESIGN Should it feel like Linear, or Salesforce?
02 UX How do reps move deals — drag, or dropdown?
03 ARCH Single team, or multi-org with permissions?

Scoping, trade-offs, edge cases — the real work. Before a line of code.

The Harness Gap Is Now Bigger Than the Model Gap

That’s the uncomfortable conclusion from the Endor Labs data.

Between Opus 4.7 in Claude Code’s native harness (87.2%) and Opus 4.7 in Cursor’s harness (91.1%), the gap is 3.9 points. Between consecutive model generations, the typical improvement is smaller. The harness is now the faster-moving variable.

This doesn’t mean model improvements don’t matter. They do. But if you’re optimizing for agent performance and you haven’t touched your harness in six months, you’ve probably left more on the table than any model upgrade would recover.

The Cursor SDK makes Cursor’s harness accessible outside the IDE for the first time. That’s the real news. Not that Cursor built a good harness — they clearly did — but that you can now use it to build Gmail integrations, Chrome plugins, and bug-catching agents without being inside Cursor’s IDE. The harness is now portable.

Anthropic knows this. That’s why they launched managed agents. Microsoft knows it. OpenAI knows it. The model wars aren’t over, but the harness wars have started, and the benchmarks are already telling you who’s winning.

Run your own tests. The numbers might surprise you.

Presented by MindStudio

No spam. Unsubscribe anytime.