Skip to main content
MindStudio
Pricing
Blog About
My Workspace

500% More Merged PRs: 4 Lessons from OpenAI's Symphony Agentic Coding Experiment

OpenAI's internal teams saw 500% more landed pull requests with Symphony. Here are 4 structural lessons about running agents at scale on real codebases.

MindStudio Team RSS
500% More Merged PRs: 4 Lessons from OpenAI's Symphony Agentic Coding Experiment

OpenAI’s Internal Teams Landed 500% More Pull Requests. Here’s What the Numbers Actually Mean.

OpenAI published Symphony — an open-source Codex orchestration spec — and buried inside the announcement was a number that should stop you cold: internal teams using the Symphony agentic coding model saw a 500% increase in landed pull requests. Not drafted. Not opened. Landed. Merged. Done.

That’s not a benchmark on a curated dataset. That’s a production signal from the teams building the models themselves. And when you dig into how Symphony actually works — and pair it with what Cursor found when they ran hundreds of agents on large codebases — a set of structural lessons emerges that has nothing to do with which model you’re using. It’s about how you organize agents to do real work at scale.

Here are four of them.


The 500% Number Only Makes Sense If You Understand What Was Broken Before

Before you can appreciate what Symphony fixed, you need to understand what agentic coding looks like when it’s not working.

The naive version of agentic coding is: give the agent a task, let it write code, review the output. That works for demos. It breaks down fast when you have a real backlog, multiple agents running in parallel, and work that spans more than one context window.

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

Cursor ran into this directly. When they started running hundreds of agents on large coding projects, they discovered that flat agent organizations — everyone running at the same level, no coordination layer — develop predictable failure modes. Agents hold locks too long. They forget to release them. They wait on each other without any mechanism to surface the blockage. And perhaps most interesting: they become risk-averse. Given a choice between a hard end-to-end task and a small, easy ticket, agents in flat orgs consistently gravitate toward the easy stuff. The hard work sits untouched.

This is not a model quality problem. It’s a coordination problem. And it’s the same coordination problem that shows up in human engineering teams when there’s no clear ownership, no shared state, and no way to see what everyone else is working on.

Symphony’s answer to this is to stop treating the agent as a standalone worker and start treating the issue tracker as the control plane. Specifically: a Linear board. The spec defines polling intervals, per-issue workspaces, active and terminal states, retries, observability hooks, concurrency limits, and handoff states. Human review is an explicit handoff state — not an afterthought, not a break in the loop, but a defined transition in the state machine.

The 500% lift in landed PRs is what happens when you solve the coordination layer, not just the generation layer.


Lesson 1: Durable State Outside the Model Is Not Optional

The first structural lesson from Symphony is one that sounds obvious until you watch agentic systems fail in production: the context window is not a source of truth.

Context can be summarized. It drifts. It gets truncated. If the work spans multiple runs, multiple agents, or multiple days — which real engineering work does — the state needs to live somewhere the model can read at the start of a run and write back to at the end.

Symphony solves this by assigning a dedicated workspace to every issue. The agent reads the ticket at the start. It writes back what happened. The next run — whether it’s the same agent or a different one — picks up from a known state, not from a reconstructed summary of a previous conversation.

This is why the choice of issue tracker matters more than most teams realize. The Linear board in Symphony isn’t just a visual planning surface. It’s a state machine. Each issue has a status, an owner, a history of what changed and when. The agent doesn’t have to infer where the work stands. It reads a field.

For teams thinking about building their own agentic coding pipelines, this is the first question to answer before you write a single line of orchestration code: where does the state live when the agent isn’t running? If the answer is “in the conversation history” or “in a log file somewhere,” you’re going to hit the same wall Symphony was designed to get around. Platforms like MindStudio handle this orchestration layer — 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows — which is why teams building production agentic systems often reach for it before rolling their own state management.


Lesson 2: Flat Agent Orgs Have a Coordination Ceiling

The second lesson is about org structure, and it’s the one that surprised me most when I first read through the Cursor findings.

REMY IS NOT
  • a coding agent
  • no-code
  • vibe coding
  • a faster Cursor
IT IS
a general contractor for software

The one that tells the coding agents what to build.

When you run a small number of agents on a small codebase, flat coordination works fine. Every agent can see every task. There’s no real contention. The agents pick up work and finish it.

Scale that up — hundreds of agents, a large codebase, a real backlog — and the flat model breaks in three specific ways.

First, lock contention. Agents claim work and hold it longer than necessary. Without a coordination layer enforcing timeouts or re-queuing, tasks get stuck in a claimed-but-not-progressing state. Second, blocking without escalation. When an agent hits a dependency it can’t resolve, it waits. In a flat org with no explicit blocker semantics, that wait is invisible to everything else in the system. Third, and most counterintuitive: risk aversion. Agents in flat orgs preferentially pick easy tasks. The hard, high-value work — the end-to-end features, the gnarly refactors — gets avoided because the agent’s local optimization is to complete something, not to complete the right thing.

Symphony addresses this through concurrency limits and explicit state transitions. The spec defines what “active” means, what “terminal” means, and what happens when an agent needs to hand off to a human or wait on another task. The issue tracker enforces these transitions because the state is in the record, not in the agent’s head.

If you’re building multi-agent coding systems and you haven’t read through how Symphony handles concurrency, the Claude Code agentic workflow patterns post covers five real patterns — schema migrations, test loops, and more — that map well onto the same structural problems Symphony is solving.


Lesson 3: Good UX Produces Better Agent Substrate (Counterintuitively)

This one is subtle and easy to miss.

Symphony uses Linear specifically, not Jira, not a custom database, not a purpose-built agent queue. Part of that is probably practical — Linear has a clean API, a clear data model, and a smaller customization surface than Jira. But there’s a deeper reason that has nothing to do with API quality.

Linear was designed to be a tool people actually want to use. The UI is fast. The opinions are strong. The customization surface is deliberately constrained. And because people like using it, they fill in the fields. They keep ownership current. They write real descriptions. They update statuses when things change.

That sounds like a UX story. It’s actually a data quality story.

When people hate a tool, they work around it. They leave fields blank. They put real decisions in Slack threads. They create tickets after the work is done, as a paper trail, not as a live record. The tracker stops reflecting reality and starts reflecting what someone thought they should document.

When people like the tool, more of the actual work ends up in the system. The state is cleaner. The descriptions are accurate. The ownership is current. And when agents arrive, they’re operating against a record that reflects what’s actually happening — not a bureaucratic artifact that lags reality by a week.

Other agents ship a demo. Remy ships an app.

UI
React + Tailwind ✓ LIVE
API
REST · typed contracts ✓ LIVE
DATABASE
real SQL, not mocked ✓ LIVE
AUTH
roles · sessions · tokens ✓ LIVE
DEPLOY
git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

The implication for teams is uncomfortable: your agent performance is partly a function of how well your humans have been using your issue tracker. If your Linear board or Jira project is full of stale tickets, missing owners, and statuses that haven’t been touched in months, your agents are going to struggle in exactly the places you want them to help. This is one of the hidden costs of messy operations — humans can compensate with memory and relationships; agents cannot.


Lesson 4: The Handoff State Is Where Most Agentic Systems Break

The fourth lesson is the most operational, and it’s the one that Symphony gets most explicitly right.

Most agentic coding demos end at generation. The agent writes the code, the demo ends, everyone applauds. What happens next — review, feedback, iteration, merge — is left as an exercise for the reader.

Symphony treats human review as a first-class state in the workflow. It’s not a break in the loop. It’s a defined transition: the agent moves the issue to a handoff state, a human reviews, and the issue transitions back to an active state or to a terminal state depending on the outcome. The agent can poll for the result. The work doesn’t disappear into a Slack thread or an email chain where no one can see its status.

This matters enormously for the 500% number. Landed pull requests aren’t just generated pull requests. They’re PRs that made it through review, got feedback incorporated, and got merged. The handoff state is where most agentic systems leak value. The agent does the work, the PR sits in review limbo, the context is lost, the feedback loop breaks down.

Symphony’s explicit handoff semantics keep the work visible and actionable throughout the entire cycle. The issue tracker knows the PR is in review. It knows who’s reviewing it. It knows when the review happened and what the outcome was. The agent can pick up the thread without reconstructing context from scratch.

For teams building their own orchestration on top of tools like Claude Code Dispatch — which lets you control your local Claude instance remotely — this is the piece most worth stealing from Symphony even if you’re not using the full spec. Make human review a state, not a gap.


The Spec as Source of Truth

There’s a broader pattern underneath all four of these lessons that’s worth naming.

Symphony works because it treats the issue tracker as a spec — a living document of what the work is, who owns it, what state it’s in, and what needs to happen next. The agents don’t decide any of that. They read it, act on it, and write back what they did. The spec is the source of truth. The agent output is derived from it.

This is the same logic behind tools like Remy, MindStudio’s spec-driven full-stack app compiler. You write your application as an annotated markdown spec — prose carries intent, annotations carry precision — and Remy compiles it into a complete TypeScript backend, SQLite database, frontend, auth, and deployment. The spec is the source of truth; the code is derived output. The abstraction is different, but the underlying principle is identical to what Symphony is doing with Linear: give the agent a clean, structured, durable source of truth to work against, and the output quality goes up dramatically.

Day one: idea. Day one: app.

DAY
1
DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

The 500% lift in landed PRs isn’t magic. It’s what happens when you stop asking agents to infer structure and start giving them structure to work against.


What the Number Actually Tells You

Five hundred percent is a big number. It’s also a number that comes with caveats — we don’t know the baseline, we don’t know the team size, we don’t know how much of the lift came from Symphony’s architecture versus the underlying model improvements in Codex.

But the direction of the result is not surprising once you understand the mechanism. The teams that saw the biggest gains weren’t the ones with the best prompts. They were the ones with the cleanest work state — clear ownership, accurate statuses, real descriptions, explicit handoff semantics. Symphony gave them a framework to run agents against that state systematically.

The question for your team isn’t whether to use Symphony specifically. It’s whether your work tracking is clean enough to be agent substrate. If you ran Symphony against your current Linear board or Jira project tomorrow, would the agents find clear ownership and accurate state? Or would they find a graveyard of stale tickets and implied context that lives in someone’s head?

That’s the diagnostic. And the answer tells you more about your agentic readiness than any benchmark.

For teams thinking about running multi-agent coding systems at scale, the comparison of Paperclip and OpenClaw for multi-agent architectures covers the coordination tradeoffs in more depth — and the structural questions there map directly onto what Symphony is solving at the issue-tracker level.

The boring infrastructure wins. It just wins faster when it’s clean.

Presented by MindStudio

No spam. Unsubscribe anytime.