Stripe Minions vs Shopify Roast: How Enterprise AI Coding Harnesses Actually Work

What “AI Coding Harnesses” Actually Are

When engineers at Stripe and Shopify talk about AI-assisted development, they’re not just describing GitHub Copilot suggestions or ChatGPT prompts. They’re describing structured systems — purpose-built scaffolding that wraps AI models in controlled, reproducible workflows designed for production code.

These are AI coding harnesses: frameworks that tell an AI model what it can do, what it can’t touch, what feedback loop to follow, and how to verify its own work before a human ever reviews it. The term “harness” is deliberate. Like a test harness, it constrains behavior. It makes AI output predictable and auditable at scale.

Stripe calls theirs Minions. Shopify built something internally referred to in engineering discussions as Roast. Both have been discussed publicly by engineering leaders, and both represent a meaningful shift in how large engineering organizations are integrating AI into their day-to-day work — not as a chatbot bolted onto an IDE, but as a structured layer of the development pipeline itself.

This article breaks down how each system works, what they have in common, where they differ, and what the patterns they’ve established mean for engineering teams at every scale.

The Problem These Systems Are Solving

Before getting into the specifics of each approach, it’s worth understanding the problem they’re both trying to solve — because it’s not “how do we use AI for coding.” It’s something more specific and harder.

Why off-the-shelf AI tools aren’t enough for enterprise engineering

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

Large engineering organizations have several properties that make generic AI tools problematic:

Massive proprietary codebases with custom abstractions, internal frameworks, and naming conventions no AI model was trained on
High compliance and security requirements — you can’t just paste production code into a web-based chat interface
Strict review and deployment gates — AI suggestions need to integrate with existing CI/CD pipelines, not bypass them
Consistency requirements — a thousand engineers using AI ad hoc will produce wildly inconsistent results
Scale — when you have hundreds of engineers using AI daily, even a 5% error rate in AI suggestions creates enormous noise in code review

The challenge isn’t whether AI can write code. It’s whether AI can write your code, in your codebase, in a way that’s consistent, verifiable, and actually helpful rather than creating more cleanup work than it saves.

Both Stripe and Shopify reached the same conclusion: you need a harness. A defined structure that gives the AI enough context to be useful, constrains it from doing things that would cause problems, and produces output in a form humans can quickly evaluate.

The limits of “just use Copilot”

GitHub Copilot and similar inline tools are genuinely useful for boilerplate, single-function generation, and autocomplete. But they operate at a narrow scope — line-by-line or function-level suggestions.

For larger, more systematic tasks — migrating an API endpoint, updating a library across a service, writing a test suite for a module — inline tools don’t have enough context. They can’t hold the whole task in view or execute a multi-step workflow.

This is the gap that Stripe and Shopify’s harness architectures are filling.

Stripe’s Minions: Parallel AI Agents with Verification

Stripe has been unusually public about the structure of its AI coding infrastructure. Several Stripe engineers and the company itself have shared details in blog posts, conference talks, and interviews about what they call the Minions system.

The core concept: many small agents, not one big one

The name “Minions” hints at the architecture. Rather than having a single large AI agent tackle a complex engineering task, Stripe spins up multiple smaller agents (the “minions”) that work in parallel on decomposed subtasks.

The orchestration layer coordinates these agents — assigning tasks, aggregating outputs, managing conflicts, and running verification. Think of it as a project manager that never sleeps, running a team of junior engineers simultaneously on different parts of the same problem.

This design solves a real issue with single-agent approaches: context window limitations and error compounding. When one AI agent tries to handle a large, complex task end-to-end, errors in early reasoning compound into larger errors later. By splitting the task, each minion operates with a cleaner, smaller context, and errors are isolated.

What a Minion task looks like in practice

Stripe’s engineering team has described use cases including:

Large-scale refactoring — updating how an internal library is called across dozens of services
Migration tasks — systematically updating deprecated API usage across a codebase
Test generation — creating test cases for existing functions at scale
Documentation passes — generating or updating inline documentation for code modules

For a refactoring task, the flow looks something like this:

A Stripe engineer defines the transformation — what pattern to find, what it should look like after
The orchestrator decomposes the work into scoped subtasks (e.g., individual service or module)
Individual minion agents execute each subtask
Each minion’s output is verified — against tests, against linting, against a defined output spec
Results are aggregated and surfaced for human review in a structured diff

The key phrase is “structured diff.” Stripe engineers aren’t reviewing AI-written code the same way they’d review a PR from a colleague who could explain their reasoning. They’re reviewing a structured output that’s been pre-verified against known criteria. The cognitive load is lower because the harness has already done a first-pass quality check.

Verification is not optional

One of the most important aspects of how Minions works is that verification is built into the loop — not added on top. Each minion output goes through:

Automated tests run against the transformed code
Linting and type-checking to catch surface-level issues
Diff review to confirm the change actually does what was requested
Scope checks to verify the minion didn’t touch things it wasn’t supposed to

This last point matters. A common failure mode in AI coding is “scope creep” — the AI decides to fix other things it notices while doing the requested task. That’s fine in human engineers (sometimes great), but in an automated system, unexpected changes are dangerous. Stripe’s harness explicitly constrains scope.

The role of context engineering

Stripe has been vocal about the importance of giving AI agents the right context — not just the code files they’re modifying, but:

Internal documentation about the codebase architecture
Prior decisions (e.g., “we moved away from X library for Y reason”)
Style guides and code standards
Test patterns already in use in the codebase

This context is injected into each minion’s prompt systematically, not haphazardly. There’s engineering work that goes into deciding what context to include, how to format it, and how to keep it updated.

This is sometimes called context engineering — and it’s increasingly seen as a core engineering discipline in AI-forward organizations.

Shopify’s Roast: AI-Assisted Code Review and Developer Feedback

Shopify’s approach has been discussed by engineering leadership including former CTO Jean-Michel Lemieux and more recently in the context of Shopify’s aggressive AI adoption mandate under Tobi Lütke’s leadership.

The term “Roast” in Shopify’s engineering context refers to an AI-powered code review assistant — one that can give detailed, opinionated feedback on pull requests before (or alongside) human reviewers.

What Roast does differently

Where Stripe’s Minions are primarily about generating code at scale, Shopify’s Roast focuses more on reviewing and improving code. The core workflow looks like this:

A developer submits a PR
Roast analyzes the diff, the context around changed code, and relevant parts of the broader codebase
It generates structured feedback — not just “this might be slow” but specific, actionable comments like “this query will cause N+1 issues in the orders table because of the association pattern you’re using here”
Developers can respond to Roast feedback, triggering follow-up analysis
Human reviewers see both the code diff and Roast’s analysis, reducing the time they need to spend on mechanical issues

Other agents start typing. Remy starts asking.

YOU SAID "Build me a sales CRM."

REMY ASKS

01 DESIGN Should it feel like Linear, or Salesforce?

02 UX How do reps move deals — drag, or dropdown?

03 ARCH Single team, or multi-org with permissions?

Scoping, trade-offs, edge cases — the real work. Before a line of code.

The “Roast” name reflects a philosophy: the AI is supposed to be direct and critical, not hedging. Shopify’s engineering culture has historically valued blunt, substantive feedback, and the AI system was designed to match that tone.

Shopify’s AI mandate and how Roast fits

In April 2025, Tobi Lütke sent a memo to all Shopify staff stating that AI usage was now a core expectation for employees — not optional. This memo, which was made public, explicitly said that AI usage would be part of performance reviews and that employees who weren’t integrating AI into their work would be at a disadvantage.

This mandate created organizational pressure that made tools like Roast not just useful but expected. Engineers who previously might have used AI ad hoc were now being evaluated on systematic adoption.

Roast became part of the standard code review workflow — not something you opt into, but something that runs on every PR above a certain threshold. This is a meaningful design decision: by making AI review the default, Shopify normalized it and built the feedback loop into the development culture.

The feedback quality problem

One challenge with AI code review that Shopify has worked to address is feedback quality. Early AI code review tools would generate a lot of noise — flagging things that were intentional, suggesting changes that didn’t fit the codebase, or being too generic to be useful.

Shopify addressed this through:

Codebase-specific training and context — Roast is not a generic tool; it’s tuned to understand Shopify’s internal frameworks (Rails, their own abstractions, etc.)
Feedback filtering — not every suggestion Roast generates surfaces to the developer; there are filtering layers that prioritize higher-confidence, higher-impact suggestions
Feedback tracking — Shopify tracks whether developers act on Roast’s suggestions, which feeds back into improving the system

This last point is worth dwelling on. Most AI coding tools have no feedback loop on whether their suggestions were actually useful. Shopify built one. Over time, the system learns which kinds of suggestions are accepted, which are ignored, and which generate discussion — and this improves the quality of future suggestions.

The developer experience side

For the developers using Roast, the experience is designed to feel like a knowledgeable colleague who’s already done a first pass before you hand something off for human review.

The benefits Shopify engineers have described include:

Catching mechanical issues (missing tests, potential performance problems, obvious bugs) before a senior engineer spends time on them
Faster iteration — getting substantive feedback in seconds rather than waiting for a human reviewer
More consistent application of code standards across a large engineering organization
Reduced reviewer fatigue — human reviewers can focus on design and judgment rather than style and correctness checking

What Both Systems Have in Common

Despite the different focus areas — Stripe generating code, Shopify reviewing it — these two systems share a set of structural principles that are worth extracting because they likely represent best practices for enterprise AI coding infrastructure.

1. Context is a first-class engineering concern

Both systems invest heavily in context engineering. The AI isn’t working in a vacuum — it has access to internal documentation, codebase conventions, past decisions, and style guides. This context is structured, maintained, and treated as a product in its own right.

This is a departure from the “just prompt it” approach. Engineering teams at this scale recognize that a great prompt with bad context produces mediocre results, while a simple prompt with excellent context produces surprisingly good results.

2. Verification and feedback loops are built in

Neither system is a “generate and ship” pipeline. Both include structured verification — automated tests, linting, quality filters — before human review. And both track what happens to AI-generated or AI-reviewed output, creating feedback loops that improve the system over time.

This is what separates a production-grade AI coding harness from a prototype. The feedback loop is what allows the system to improve.

3. AI is a layer in the existing workflow, not a replacement

Both Stripe and Shopify integrated their AI systems into existing developer workflows — PRs, CI/CD, code review processes — rather than creating parallel AI-specific workflows that engineers would have to context-switch into.

Engineers still submit PRs. They still have human reviewers. They still run tests. The AI is a layer that enhances these existing processes, not a separate system that competes with them.

4. Human judgment is preserved for high-stakes decisions

Both systems have explicit points where human review is required. AI handles the mechanical and pattern-matching work. Humans handle design decisions, trade-offs, and anything where the right answer isn’t deterministic.

This isn’t just about quality — it’s about accountability. In enterprise engineering, someone needs to be responsible for what goes into production. AI-generated code reviewed by a human and merged under that human’s name preserves that accountability chain.

5. The systems are opinionated about scope

Both Minions and Roast have explicit scope constraints. Minions don’t modify things they weren’t asked to modify. Roast doesn’t review what it isn’t asked to review. This constraint-first design makes the systems trustworthy — engineers know what the AI will and won’t do.

Key Differences: Generation vs. Review

While the principles overlap, there are meaningful differences in what these two systems prioritize.

Focus area

	Stripe Minions	Shopify Roast
Primary function	Code generation at scale	Code review and feedback
Task type	Systematic, repeatable transformations	Evaluation and critique of existing code
Developer interaction	Define task → review structured output	Write code → receive feedback → iterate
AI agents	Multiple parallel agents	Single review agent per PR
Verification method	Automated tests + linting + diff review	Feedback quality filtering + acceptance tracking
Integration point	Pre-PR (generates the code)	At PR submission

Organizational context

Stripe’s Minions emerged from a need to do large-scale migrations and refactoring across a complex, rapidly evolving platform. The goal was throughput — doing work that would take a team of engineers weeks, in hours.

Shopify’s Roast emerged from a need to maintain code quality at scale as the engineering organization grew. The goal was consistency — making sure every PR gets substantive review, not just the ones that happened to land with an experienced reviewer.

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

remy.msagent.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

These different origins explain the different architectures. Minions is an agent orchestration system optimized for parallelism and throughput. Roast is a review pipeline optimized for accuracy, feedback quality, and developer experience.

What they tell us about the different challenges

Stripe’s challenge is fundamentally about doing more — accelerating the execution of complex engineering work that they know needs to happen.

Shopify’s challenge is fundamentally about maintaining quality — ensuring that a large, fast-moving engineering organization doesn’t accumulate technical debt or inconsistency as it scales.

Both are real problems at large organizations. And the fact that two well-regarded engineering organizations arrived at different tool shapes tells us something: there isn’t one “right” AI coding harness. The right shape depends on your specific constraint.

What Smaller Teams Can Take From Enterprise AI Harnesses

You don’t need Stripe’s infrastructure to apply the principles behind Minions. You don’t need Shopify’s scale to benefit from structured AI code review. Here’s what smaller teams can actually implement.

Start with a context document

The single highest-leverage action most teams can take is creating a well-structured context document for their codebase — and using it in every AI coding session.

This document should include:

The tech stack and major dependencies
Internal abstractions or custom frameworks, with brief explanations
Code style decisions and their rationale
What the codebase should not do (deprecated patterns, anti-patterns)
How tests are structured
Common workflows (e.g., how to add a new API endpoint)

Drop this into a system prompt, a custom instructions file, or a .cursorrules / .claude config file depending on your tool. This is 80% of what Stripe’s context engineering effort produces.

Define the task type before reaching for AI

Stripe and Shopify use their AI tools for specific, defined task types — not “just help me with code.” Before using AI for a coding task, get specific about what you’re asking for:

Is this a transformation (existing code → updated code with defined rules)?
Is this a generation (new code for a defined spec)?
Is this a review (critique existing code against defined standards)?
Is this a debugging session (trace through behavior to find an issue)?

Different task types call for different prompts, different context, and different verification approaches. Being explicit about the task type before you start improves results significantly.

Build verification in, not on top

This is the hardest principle to implement but the most important. Whatever AI output you’re generating, have a defined way to verify it before it becomes someone’s problem to untangle.

For code generation:

Run your test suite against AI-generated code before reviewing it yourself
Add a linting step specifically for AI output
Check git diffs to confirm scope — the AI should have changed what you asked, not more

For code review:

Don’t accept AI feedback uncritically; have a quick manual check for suggestions that touch areas the AI might not have full context on

Use structured output formats

Both Minions and Roast produce structured output — not just raw text. When getting AI help with code, ask for structured responses:

“Review this code and give me feedback in three categories: correctness, performance, and style”
“Generate this function and include: the code, a brief explanation of the approach, and the test cases”
“Identify potential bugs in this PR diff and list them with: location, issue, suggested fix”

Wondering what the Hermes hype is about? Free 60-minute primer

Structured output is easier to parse, prioritize, and act on than a wall of AI text.

Track what you use (and what you don’t)

Shopify’s feedback loop — tracking whether developers act on Roast’s suggestions — is something you can approximate manually. Keep a simple log of AI suggestions you used, ignored, or modified. After a few weeks, look at the patterns.

This tells you where the AI is genuinely useful for your codebase and where it’s generating noise. You can then refine your prompts and context accordingly.

Where MindStudio Fits in the AI Coding Workflow

The principles behind Stripe Minions and Shopify Roast aren’t limited to giant engineering orgs with custom infrastructure. The underlying pattern — structured context, defined task scope, automated verification, and a feedback loop — applies equally to smaller teams.

The challenge for most teams is infrastructure. Building what Stripe built requires engineering time. Most teams don’t have months to invest in building and maintaining a custom AI harness.

This is where MindStudio is relevant. MindStudio lets you build structured AI workflows without custom infrastructure — the same kind of orchestration logic that sits behind Minions or Roast, but accessible without standing up your own system.

Specifically, MindStudio’s Agent Skills Plugin (the @mindstudio-ai/agent npm SDK) allows any AI agent — including Claude Code, LangChain agents, or custom scripts — to call pre-built, typed capabilities as simple method calls. The infrastructure layer (rate limiting, retries, authentication) is handled automatically, so the agent focuses on the reasoning and execution logic.

For a small engineering team looking to build something like a code review assistant or a systematic refactoring workflow, MindStudio’s visual builder can act as the orchestration layer — connecting the AI model, the code input, verification steps, and output formatting — without requiring the team to write all of that scaffolding themselves.

You can build, for example:

A PR review workflow that ingests a diff, pulls in relevant codebase context from a Notion or Google Doc, runs it through a defined review prompt, and delivers structured feedback to Slack
A batch transformation tool that accepts a defined code pattern, applies it across provided files, and outputs a structured diff
A test generation agent that reads a function definition and produces test cases in your team’s established test format

The average MindStudio workflow takes under an hour to build. For teams that want structured AI coding assistance without the overhead of custom infrastructure, it’s a practical starting point.

You can try it free at mindstudio.ai.

FAQ: AI Coding Harnesses, Minions, and Roast

What is an AI coding harness?

An AI coding harness is a structured system that wraps an AI model in defined workflows for software development tasks. Rather than using an AI model ad hoc through a chat interface, a harness provides:

A structured input format (context, task definition, scope)
Constrained execution (what the AI can and cannot modify)
Automated verification (tests, linting, diff review)
Structured output (formatted for human review or downstream processing)

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

The term “harness” comes from software testing — a test harness constrains and structures test execution. An AI coding harness does the same for AI-generated or AI-assisted code.

What is Stripe Minions?

Stripe Minions is Stripe’s internal AI coding system, built around a multi-agent architecture. The system decomposes complex engineering tasks (like large-scale refactoring or migration) into smaller subtasks, assigns each to individual AI agents (“minions”), runs them in parallel, verifies the output of each, and aggregates the results for human review.

The system is designed for throughput — doing systematic engineering work at a scale and speed that would be impractical with purely human effort.

What is Shopify’s AI coding system?

Shopify has built a code review assistant referred to in engineering discussions as Roast. It analyzes pull request diffs against the broader codebase context and generates structured, actionable feedback before (or alongside) human review.

Roast is designed to maintain code quality at scale — ensuring that every PR gets substantive feedback regardless of reviewer availability or bandwidth. Shopify has also made AI usage a company-wide expectation, with Tobi Lütke’s 2025 memo explicitly stating that AI integration is now part of performance evaluation.

How do Stripe Minions and Shopify Roast differ?

The core difference is focus:

Stripe Minions focuses on code generation — creating or transforming code at scale through parallel AI agents
Shopify Roast focuses on code review — providing structured feedback on existing code before human reviewers engage

Stripe’s system solves the problem of “we have a lot of systematic engineering work to do.” Shopify’s system solves the problem of “we need consistent code quality review across a large, fast-moving team.”

Both are harnesses. Both share principles around context engineering, structured output, and verification. But they’re optimized for different points in the development workflow.

Can small teams build AI coding harnesses?

Yes, and the principles scale down well. The key practices are:

Build a codebase context document and use it in every AI session
Define the task type explicitly before prompting (transformation, generation, review, debugging)
Run automated verification (tests, linting) on AI output before human review
Ask for structured output, not raw narrative responses
Track which AI suggestions you use vs. ignore to improve your prompts over time

You don’t need a multi-agent orchestration system or a custom PR review bot to benefit from these practices. They apply whether you’re using Claude, GPT-4, Cursor, or any other AI coding tool.

What context should you give an AI for coding tasks?

For coding tasks, useful context includes:

Stack and dependencies — what language, framework, major libraries
Internal abstractions — custom utilities, internal frameworks, naming conventions
Code style decisions — with rationale, not just rules
Anti-patterns — things the codebase explicitly avoids
Test patterns — how tests are structured in the project
Scope constraints — what files or areas should not be modified

The more specific and accurate your context, the better the AI’s output. Generic context produces generic (and often wrong) suggestions. Codebase-specific context produces suggestions that fit.

How does AI code review reduce reviewer fatigue?

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

AI code review reduces reviewer fatigue by handling the mechanical layer of review before human reviewers engage. This includes:

Checking for obvious bugs or missing error handling
Identifying performance issues (N+1 queries, unnecessary loops, etc.)
Verifying that tests cover the changed code
Flagging code style or convention violations

When human reviewers arrive at a PR that’s already been through AI review, they can skip these mechanical checks and focus on higher-order concerns: design decisions, architectural trade-offs, business logic correctness, and long-term maintainability.

This division of labor works because mechanical checks and design judgment require different skills and different context. AI is good at the former when given the right codebase context. Humans are necessary for the latter.

Key Takeaways

The Stripe Minions and Shopify Roast systems represent two distinct approaches to the same underlying challenge: integrating AI into enterprise software development in a way that’s structured, verifiable, and actually improves outcomes rather than creating new overhead.

Here’s what matters most from both:

Harnesses, not prompts. The difference between useful AI coding assistance and frustrating noise is structure. Context engineering, scope constraints, verification steps, and structured output are what separate a production system from an experiment.
Generation and review are different problems. Stripe optimized for throughput (code generation at scale). Shopify optimized for quality assurance (code review at scale). Both are valid targets — but they require different architectures.
Feedback loops are what make systems improve. Both organizations track outcomes — whether AI output passes verification, whether feedback suggestions are acted on. Without feedback loops, AI tools don’t get better; they just keep producing the same quality of output.
Context is the highest-leverage investment. In both systems, the quality of the AI’s output is directly tied to the quality of the context it receives. This is true at any scale — from Stripe’s engineering organization to a two-person startup.
Human judgment stays in the loop. Neither system removes human review from the critical path. AI handles pattern-matching and mechanically verifiable work. Humans handle design and accountability. That division of responsibility is intentional and important.

If you’re thinking about building structured AI workflows for your own team — whether for code review, systematic generation, or any other development task — MindStudio lets you start building that infrastructure today without standing up custom tooling from scratch.

What “AI Coding Harnesses” Actually Are

The Problem These Systems Are Solving

Why off-the-shelf AI tools aren’t enough for enterprise engineering

Built like a system. Not vibe-coded.

The limits of “just use Copilot”

Stripe’s Minions: Parallel AI Agents with Verification

The core concept: many small agents, not one big one

What a Minion task looks like in practice

Verification is not optional

The role of context engineering

Shopify’s Roast: AI-Assisted Code Review and Developer Feedback

What Roast does differently

Other agents start typing. Remy starts asking.

Shopify’s AI mandate and how Roast fits

The feedback quality problem

The developer experience side

What Both Systems Have in Common

1. Context is a first-class engineering concern

2. Verification and feedback loops are built in

3. AI is a layer in the existing workflow, not a replacement

4. Human judgment is preserved for high-stakes decisions

5. The systems are opinionated about scope

Key Differences: Generation vs. Review

Focus area

Organizational context

Seven tools to build an app. Or just Remy.

What they tell us about the different challenges

What Smaller Teams Can Take From Enterprise AI Harnesses

Start with a context document

Define the task type before reaching for AI

Build verification in, not on top

Use structured output formats

Track what you use (and what you don’t)

Where MindStudio Fits in the AI Coding Workflow

FAQ: AI Coding Harnesses, Minions, and Roast

What is an AI coding harness?

Remy is new. The platform isn't.

What is Stripe Minions?

What is Shopify’s AI coding system?

How do Stripe Minions and Shopify Roast differ?

Can small teams build AI coding harnesses?

What context should you give an AI for coding tasks?

How does AI code review reduce reviewer fatigue?

Remy doesn't write the code. It manages the agents who do.

Key Takeaways

Related Articles

How to Build an AI Workflow That Controls the Agent Instead of Letting the Agent Control Everything