How Stripe Ships 1,300 AI PRs a Week: Harness Engineering

From Prompts to Pipelines: A New Layer of AI Engineering

Stripe ships roughly 1,300 AI-generated pull requests every week. Not 1,300 prompts. Not 1,300 code suggestions. Full pull requests — reviewed, merged, and deployed. That number is only possible because Stripe isn’t just using AI assistants. They’ve built infrastructure around AI coding agents that allows many agents to work simultaneously, each handling a discrete task, feeding results to the next step in the pipeline.

That infrastructure has a name: harness engineering.

If you’ve been paying attention to how serious engineering teams are deploying AI, you’ve probably noticed a shift. The conversation has moved past “how do I write better prompts” and even past “how do I manage context windows.” The frontier is now about orchestrating systems of AI agents — and harness engineering is the discipline that describes how to do it.

This article breaks down what harness engineering is, how it differs from prompt and context engineering, what it looks like in practice, and why it matters for anyone building with AI today.

The Evolution: Three Generations of AI Engineering Skill

To understand harness engineering, you need to understand what came before it.

Prompt Engineering (Generation 1)

Other agents start typing. Remy starts asking.

YOU SAID "Build me a sales CRM."

REMY ASKS

01 DESIGN Should it feel like Linear, or Salesforce?

02 UX How do reps move deals — drag, or dropdown?

03 ARCH Single team, or multi-org with permissions?

Scoping, trade-offs, edge cases — the real work. Before a line of code.

Prompt engineering was the first skill that emerged when language models became useful. The core idea: the quality of your input determines the quality of your output. If you structure your request well, add clear instructions, use specific formatting, and give the model enough context to reason through a task, you get better results.

Prompt engineering is still valuable. But it has a ceiling. It works well for single-turn, contained tasks. It doesn’t scale when you’re trying to build software systems, complete multi-step workflows, or coordinate work across a codebase.

Context Engineering (Generation 2)

Context engineering is the practice of what you put in the context window — and how. It’s less about phrasing requests and more about information architecture. Which files does the agent need to see? What documentation is relevant? How do you prevent the context from getting polluted with noise?

As models got larger context windows (Claude and Gemini now support over a million tokens), context engineering became more sophisticated. Retrieval-augmented generation (RAG), memory systems, and structured context injection all fall under this umbrella.

Context engineering is a more advanced skill than prompt engineering. But it’s still fundamentally about managing a single agent’s awareness.

Harness Engineering (Generation 3)

Harness engineering takes a different approach entirely. Instead of trying to make one agent smarter or more aware, it asks: what if you run many agents, each doing a specific job, and connect their outputs into a coherent pipeline?

A harness, in this context, is the scaffolding that connects multiple AI agent sessions — handling how tasks are assigned, how outputs are passed between agents, how errors are caught, and how the whole system reaches a final result. It’s the difference between one developer and a well-coordinated team.

What Harness Engineering Actually Is

Harness engineering is the discipline of designing, building, and maintaining the infrastructure that orchestrates AI coding agents at scale.

The word “harness” comes from testing in software engineering. A test harness is the scaffolding you build around code to test it systematically — inputs, expected outputs, validation logic. A harness for AI agents follows the same logic: you’re building the system that runs agents, routes their work, checks their outputs, and chains results together.

In practice, a harness might include:

Task decomposition logic — breaking a large engineering task into discrete subtasks each agent can handle independently
Agent spawning and session management — starting, stopping, and managing multiple agent sessions in parallel or sequentially
Output validation — checking that an agent’s work meets criteria before passing it to the next stage
Feedback loops — routing failed outputs back to agents for correction
Coordination protocols — deciding which agent works on which part of the codebase, and when
State management — tracking what’s been done, what’s in progress, and what still needs work

This is significantly more complex than writing good prompts. It requires thinking about agents as components in a distributed system — with all the architecture decisions that implies.

Why This Is Different From Multi-Agent Frameworks

You might be thinking: isn’t this just multi-agent AI? Frameworks like LangChain, AutoGen, and CrewAI have been talking about multi-agent systems for a while. What’s different about harness engineering?

The distinction is mostly about scope and specificity. Multi-agent frameworks are general-purpose tools for building agent networks. Harness engineering is specifically about applying that orchestration logic to software engineering workflows — with all the constraints and patterns that entails.

Coding agents have particular requirements that differ from general AI agents:

They work with files, not just text
Their outputs need to pass tests, lint checks, and code review
They operate on repositories with complex dependency graphs
Mistakes compound — a bad commit in step 3 breaks everything in step 7
Parallel work can create merge conflicts

Harness engineering addresses these specifics. It’s not just “connect agents together.” It’s “connect agents together in a way that produces working software.”

How Stripe and Other Teams Are Doing It

The Stripe example is instructive. The company has been public about its aggressive adoption of AI coding tools and has reported generating over 1,300 AI pull requests per week. That volume implies a system — not individual engineers sitting with Cursor or Claude Code, but an orchestrated pipeline where agents can pick up tasks, generate code, run tests, and submit PRs with minimal human intervention at each step.

Other companies have made similar moves. According to Anthropic’s reporting on Claude Code usage, teams are increasingly running Claude Code in “agentic” mode — where the model takes multi-step actions autonomously rather than waiting for turn-by-turn input. In headless or non-interactive mode, Claude Code can be integrated into CI/CD pipelines, scheduled jobs, or triggered by issue trackers.

Google has reported that AI now generates more than 25% of new code at the company. Amazon has cited similar numbers. At this scale, you’re not writing individual prompts. You’re engineering systems that run agents.

The pattern that’s emerging across these teams looks roughly like this:

A task comes in (from a GitHub issue, Jira ticket, or internal system)
A harness breaks it into subtasks
Individual agents (often specialized — one for writing tests, one for implementation, one for documentation) execute each subtask
Validation checks run automatically
If a subtask fails, the harness either retries with different instructions or routes to a human review queue
Passing outputs are assembled into a final deliverable (PR, build artifact, updated docs)

This is harness engineering in operation.

The Core Skills Harness Engineers Need

Harness engineering sits at the intersection of software architecture and AI systems design. The people doing it well tend to have a combination of skills that wasn’t previously bundled into one role.

System Design for AI Workloads

You need to think about agents the way you’d think about microservices. Each agent should have a clear input/output contract. Agents should be stateless where possible. Side effects should be explicit and tracked. Failure modes need to be designed for, not just hoped away.

Task Decomposition

This is arguably the hardest skill. Breaking down complex engineering work into subtasks that an AI agent can reliably execute — without losing the coherence of the whole — requires deep understanding of both the problem domain and agent capabilities.

Tasks that are too large overwhelm agents. Tasks that are too granular create coordination overhead. Finding the right decomposition is part science, part craft.

Evaluation and Validation Design

How do you know when an agent’s output is good? For code, you can run tests. But what about code that needs tests written? What about documentation? Configuration changes? Harness engineers need to think carefully about validation criteria for every stage of the pipeline.

Prompt and Context Engineering (Still Required)

Harness engineering doesn’t replace the earlier skills — it builds on them. Each individual agent in a harness still needs good prompts and well-managed context. Harness engineering adds the orchestration layer on top.

Building a Basic Harness: What It Looks Like

You don’t need to be working at Stripe to experiment with harness engineering. Here’s what a basic harness looks like in practice.

Define the Workflow

Start by mapping out the end-to-end task you want to automate. For a simple feature implementation, this might look like:

Parse the feature request and extract requirements
Generate a test file based on the requirements
Implement code to pass the tests
Run the tests and validate output
Write documentation for the new feature
Generate a PR description and open a draft PR

Each step is handled by an agent (or the same agent in a fresh session with specific instructions).

Build the Scaffolding

The scaffolding is the code or workflow system that:

Passes outputs from one step to the next
Catches failures and decides what to do with them
Tracks state across the whole run
Logs what happened for debugging

This can be custom code in Python or JavaScript, or it can be built with a workflow automation tool. The key is that the scaffolding is deterministic — the AI is doing the reasoning at each step, but the routing and state management is handled by code you control.

Define Validation Gates

After each agent step, define what “success” looks like. For a test generation step, success might mean the test file is syntactically valid Python and references the correct function signatures. For an implementation step, success means the tests pass.

Validation gates are what prevent errors from cascading downstream. Without them, a subtle mistake in step 2 can cause a catastrophic failure in step 6 that’s nearly impossible to debug.

Handle Failure Gracefully

Agents fail. They misunderstand instructions, produce incomplete outputs, or get stuck in loops. A well-designed harness expects this and handles it gracefully — whether that means retrying with a modified prompt, escalating to a human, or skipping to a known-good state.

Where MindStudio Fits In

One challenge with harness engineering is that it typically requires non-trivial infrastructure work — usually in Python or Node.js — before you can start experimenting with agent orchestration. That’s a meaningful barrier if you’re trying to prototype a multi-agent workflow without standing up a full backend first.

This is where MindStudio’s multi-agent workflow builder becomes useful. MindStudio lets you build the orchestration layer visually — connecting AI agents in sequence or parallel, defining what gets passed between steps, setting up conditional routing, and integrating with external tools like GitHub, Jira, or Slack without writing infrastructure code.

For harness engineering use cases, the relevant capability is MindStudio’s workflow automation system, which supports chaining AI steps together with validation logic, branching on outputs, and connecting to webhooks or scheduled triggers. You can define what each “agent” in your harness does, what it receives as input, and what it produces — then connect them into a working pipeline.

If you’re a developer who wants more control, MindStudio’s Agent Skills Plugin (@mindstudio-ai/agent) lets coding agents like Claude Code call MindStudio capabilities directly as method calls — so your agents can trigger workflows, send notifications, or invoke other agents as part of their execution. This is a natural fit for harness engineering patterns where you want agents to coordinate rather than operate in isolation.

You can try MindStudio free at mindstudio.ai.

The Organizational Implications

Harness engineering isn’t just a technical shift — it has real implications for how engineering teams are structured and what skills are valued.

New Specialization Emerging

At companies where harness engineering is mature, you’re starting to see dedicated roles for it. Engineers who specialize in AI workflow orchestration, agent infrastructure, and evaluation design. This is similar to how DevOps or platform engineering emerged as specializations within software development.

Changing Leverage

In a world with harness engineering, individual engineers can have dramatically more leverage. A single engineer with a well-designed harness can oversee the work of dozens of parallel agent sessions. The bottleneck shifts from writing code to reviewing, validating, and steering agent output.

Code Review Evolves

When agents are generating code at scale, code review becomes a different kind of work. Reviewers need to develop intuitions for what AI-generated code looks like, where it tends to go wrong, and how to efficiently evaluate whether a harness-generated PR is trustworthy.

Frequently Asked Questions

What is harness engineering?

Harness engineering is the discipline of building and managing the infrastructure that orchestrates multiple AI coding agent sessions. Instead of using a single AI assistant interactively, a harness connects many agents — each handling a specific task — into a coordinated pipeline. The harness manages task assignment, output validation, failure handling, and state tracking across the whole workflow.

How is harness engineering different from prompt engineering?

Prompt engineering focuses on crafting better inputs for a single AI interaction. Harness engineering operates at a higher level — it’s about the system that runs many agents, routes their outputs, and assembles results into working software. You still need prompt engineering skills within a harness, but the harness layer itself is about architecture, not phrasing.

What is context engineering and where does it fit?

Context engineering is the practice of managing what information goes into an AI model’s context window — which files, documents, or data the model can see when it reasons. It’s more advanced than basic prompt engineering and is essential for working with large codebases. Harness engineering builds on context engineering by managing context at a system level — ensuring each agent in the pipeline has exactly the context it needs for its specific job.

Do I need to code to build a harness?

Not necessarily. Harnesses can be built with no-code workflow tools that support AI agent orchestration. However, more complex harnesses — especially those integrating deeply with CI/CD pipelines, repositories, or internal systems — typically require some engineering work. The degree of coding needed depends on the complexity of the workflow you’re automating.

What kinds of tasks are well-suited for harness engineering?

Harness engineering works best for complex, multi-step tasks that can be broken into discrete subtasks with clear inputs and outputs. In software engineering, this includes: feature implementation from spec to PR, automated test generation and bug fixing, large-scale refactoring across a codebase, documentation generation, and dependency upgrades. Tasks that require deep human judgment at every step are less suited to full harness automation.

Is harness engineering the same as agentic AI?

They’re related but not the same. “Agentic AI” broadly refers to AI systems that take multi-step actions autonomously rather than responding to one-off prompts. Harness engineering is a specific methodology for building the orchestration infrastructure that makes agentic AI systems work reliably in software engineering contexts. Not all agentic AI involves a harness, and a harness can coordinate agents that aren’t individually very “agentic.”

Key Takeaways

Harness engineering is the third generation of AI engineering skill, following prompt and context engineering — it focuses on orchestrating multiple AI agent sessions into coordinated pipelines.
The core of a harness includes task decomposition, agent session management, output validation, failure handling, and state tracking.
It’s distinct from general multi-agent frameworks because it’s specifically designed for software engineering workflows, where outputs must pass tests, meet code standards, and integrate into version control.
Teams at scale like Stripe are generating thousands of AI pull requests per week using this approach — that volume only works with harness-level infrastructure.
The skills required include system design, task decomposition, evaluation design, and traditional prompt/context engineering — not one or the other.

If you want to start experimenting with multi-agent workflows without standing up custom infrastructure, MindStudio gives you the orchestration layer to connect AI agents, manage state between steps, and integrate with your existing tools — free to start, no code required.