Skip to main content
MindStudio
Pricing
Blog About
My Workspace

Claude Code's Creator Says Anthropic Has Zero Manually Written Code — Here's How They Did It

Boris Churnney, who built Claude Code, says every line at Anthropic is now AI-generated. Here's the multi-agent setup behind that claim.

MindStudio Team RSS
Claude Code's Creator Says Anthropic Has Zero Manually Written Code — Here's How They Did It

Boris Churnney Said It Out Loud — Now What Do You Do With It?

Boris Churnney, the engineer who built Claude Code, stood in front of a room at Anthropic’s Code with Claude event and said something that should stop you mid-scroll: “There is literally no manually written code anywhere in the company anymore. Clouds coordinate with each other over Slack, code in loops, and resolve issues across the codebase.”

Not “we use AI to help write code.” Not “our developers are more productive with AI assistance.” Zero manually written code. At the company that builds the model.

If you’re building software professionally, that sentence deserves more than a retweet. It deserves a close read of how they got there — because the architecture behind that claim is specific, and most of it is now available to you.


What Zero Manually Written Code Actually Looks Like in Production

The claim sounds like a headline. The mechanism is the interesting part.

Anthropic’s internal setup isn’t a single Claude instance with a very long prompt. It’s a multi-agent orchestration system where a lead agent breaks work into pieces and delegates each piece to specialist sub-agents — each with its own model selection, its own prompts, and its own tools. Those sub-agents run in parallel on a shared file system. The lead agent can check in on sub-agents mid-workflow, not just at the end.

VIBE-CODED APP
Tangled. Half-built. Brittle.
AN APP, MANAGED BY REMY
UIReact + Tailwind
APIValidated routes
DBPostgres + auth
DEPLOYProduction-ready
Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

The entire execution is auditable inside Claude Console. You can see what each sub-agent did, in what order, and read the reasoning behind each decision. That last part matters more than people give it credit for: auditability is what separates a production system from a demo.

What keeps the output quality high isn’t just the model. It’s the harness around it.

Churnney also pushed back on the term “vibe coding” at the same event — he said it no longer describes what he and most developers actually do. Andrej Karpathy, who coined the term, has suggested “agentic engineering” as a replacement. Churnney isn’t fully sold on that either, but the point stands: what Anthropic is running internally involves copious automated testing and verification loops. It’s not vibes. It’s a system.


The Infrastructure You Need Before This Works

Before you try to replicate any of this, be honest about what you’re starting with.

A model that can reason about code at the task level. Opus-class models are doing the heavy lifting here. The recent API rate limit increases — output tokens went from 8,000 per minute to 80,000 per minute on higher tiers — matter specifically because parallel sub-agents generate a lot of output tokens simultaneously. If you were rate-limited into using Haiku or Sonnet for everything, that constraint has loosened considerably.

A shared file system or state layer. Sub-agents need somewhere to write intermediate outputs that other agents can read. This doesn’t have to be exotic — a shared directory, a database, or even structured files work. The key is that agents aren’t passing everything through the lead agent’s context window.

A rubric. This is the part most builders skip. Anthropic’s Outcomes feature — where a separate grading agent scores output against a user-defined rubric — produced an 8.4% improvement in Word document quality and a 10.1% improvement in PowerPoint quality on their internal benchmarks. No model change. Just a grading agent with a rubric. If you don’t have a written definition of what “good” looks like for your task, you can’t automate quality enforcement.

Error recovery and state management. The April launch of Anthropic’s managed agents platform added sandbox environments, state management, and error recovery — plus the ability to run agents on a cloud compute instance rather than a local desktop session. If you’re running agents locally and they fail halfway through a long task, you’re starting over. That’s a development workflow, not a production one.


Building the System: From Single Agent to Coordinated Fleet

Step 1: Define the job decomposition

Start with a task your team does repeatedly. Not a one-off creative project — a recurring workflow with predictable structure. Report generation, code review, data extraction, proposal drafting.

Write out every distinct subtask. A financial analysis might decompose into: pull market data, summarize competitor filings, build the financial model, draft the narrative, format the output. Each of those is a candidate for a specialist sub-agent.

Now you have a task map.

Step 2: Assign models and tools to each sub-agent

Plans first. Then code.

PROJECTYOUR APP
SCREENS12
DB TABLES6
BUILT BYREMY
1280 px · TYP.
yourapp.msagent.ai
A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Not every sub-agent needs Opus. The sub-agent pulling structured data from an API might run fine on Sonnet. The sub-agent writing the narrative section probably wants Opus. Model selection per sub-agent is one of the main cost levers in a multi-agent system — and it’s one of the things the Anthropic managed agents platform now supports explicitly.

Each sub-agent also gets its own tool access. The market research agent gets web search. The model builder gets code execution. The formatter gets file write access. Scoping tools per agent reduces the blast radius when something goes wrong.

If you’re building this without Anthropic’s managed infrastructure, platforms like MindStudio handle this orchestration layer: 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows without writing the orchestration code yourself.

Step 3: Build the grading agent

This is the step that produces the 10% quality improvement, and it’s the one most builders skip because it feels like overhead.

Write a rubric. Literally a document. For a financial pitch, it might include: Does the executive summary fit on one page? Are all claims sourced? Is the financial model internally consistent? Does the narrative match the numbers? Is the tone appropriate for the audience?

The grading agent reads the output and scores it against each criterion. If the score falls below your threshold, it returns the output to the task agent with specific notes on what failed. The task agent revises. This loop runs until the rubric is satisfied or a maximum iteration count is hit.

Every/Spiral, the AI writing tool, uses exactly this pattern — a multi-agent system with an editorial rubric enforcing writing quality before anything reaches a human editor. The rubric encodes their editorial standards and writer voice. The grading agent enforces it automatically.

Now you have a quality gate that doesn’t require a human to sit there reviewing every output.

Step 4: Add the memory layer

Single-session agents forget everything. Anthropic’s Dreaming feature addresses this: a scheduled process that reviews past agent sessions and memory stores, extracts patterns, and curates memories so agents improve over time. It surfaces recurring mistakes, workflows that agents converge on, and preferences shared across a team.

The core mechanic is that agents don’t just deliver completed tasks — they report what they learned while doing the task. That learning gets encoded into orchestration memory and preloaded the next time that sub-agent runs.

If you’re building this yourself rather than using managed agents, the three-layer memory architecture from the Claude Code source leak is a useful reference — it covers how memory.md functions as a pointer index and how persistent cross-session memory gets structured in practice.

Step 5: Make it auditable

Churnney’s system is auditable. That’s not an accident. When Claude Console shows you what each sub-agent did and in what order, with the reasoning behind each decision, you get two things: debugging capability when something goes wrong, and the ability to explain the output to a stakeholder who asks “how did you get this?”

Build logging in from the start. Each sub-agent should write a brief record of what it did, what inputs it received, and what decision it made at any branch point. This is cheap to add and expensive to retrofit.

Now you have a system that can be inspected, not just run.


Where This Actually Breaks

How Remy works. You talk. Remy ships.

YOU14:02
Build me a sales CRM with a pipeline view and email integration.
REMY14:03 → 14:11
Scoping the project
Wiring up auth, database, API
Building pipeline UI + email integration
Running QA tests
✓ Live at yourapp.msagent.ai

The rubric problem. The grading agent is only as good as the rubric. If your rubric is vague (“the output should be high quality”), the grading agent will pass things it shouldn’t. Writing a good rubric takes domain expertise and iteration. Plan for the rubric to be wrong the first few times and build in a process for updating it based on what the grading agent passes that humans later reject.

Context window management in parallel execution. When sub-agents run in parallel and write to a shared file system, the lead agent’s context fills up with summaries and status updates. If you’re not actively managing what gets loaded into the lead agent’s context, you’ll hit limits on long-running tasks. The five Claude Code workflow patterns post covers compaction and context management strategies that apply directly here.

Model routing costs. Running Opus on every sub-agent for a high-volume workflow gets expensive fast. The cost optimization Every/Spiral uses — tapping different Anthropic models for different sub-tasks — is worth building in early. Audit which sub-agents actually need frontier-model reasoning and which are doing structured extraction that a smaller model handles fine.

Error recovery in long chains. A five-agent pipeline where agent three fails leaves you with partial outputs and a decision about whether to restart from scratch or resume from the checkpoint. Managed agents handle this with built-in state management. If you’re rolling your own, you need explicit checkpointing logic. Without it, a transient API error in a 20-minute pipeline means starting over.

The Slack coordination piece. Churnney mentioned that Anthropic’s agents coordinate over Slack. That’s not just a cute detail — it means the system has a communication layer that’s human-readable and interruptible. Humans can see what’s happening, inject context, and course-correct without stopping the whole pipeline. If your multi-agent system has no human-readable communication layer, you’re flying blind. The Claude Code Dispatch setup is one pattern for building that kind of remote visibility and control into an agentic workflow.


The Abstraction Shift Churnney Is Describing

Here’s the opinion: what Churnney is describing isn’t just a productivity improvement. It’s a change in what “writing software” means.

The history of programming is a history of raising the abstraction level. Assembly over machine code. C over assembly. High-level languages over C. Each step meant writing less of the lower layer and specifying more of the intent. The code still existed — it just wasn’t the thing you authored directly.

What Anthropic is running internally is the next step in that sequence. The source of truth is the specification: the rubric, the task decomposition, the orchestration logic, the memory schema. The code is derived output. Tools like Remy make this concrete in a different domain — you write an annotated markdown spec, and a complete TypeScript backend, database, auth layer, and deployment get compiled from it. The spec is what you maintain; the code is what gets generated. That’s the same abstraction shift, applied to full-stack app development.

Churnney’s discomfort with “vibe coding” makes sense in this frame. Vibe coding implies looseness, approximation, hoping it works. What Anthropic is running is the opposite: highly specified intent, automated quality enforcement, auditable execution, persistent memory. The process is rigorous. The artifact that’s gone is manually typed code.


What to Try This Week

If you’re building with Claude Code today, the multi-agent workflow patterns post is the right starting point for structuring parallel agent execution. If you want to understand how Claude Code’s own memory architecture handles persistence across sessions, the source leak analysis is worth reading before you design your own memory layer.

The Claude Finance cookbook Anthropic released alongside the 10 pre-built finance agents — pitch builder, meeting preparer, market researcher, evaluation reviewer, month-end closer — is also worth pulling apart even if you’re not in financial services. The cookbook shows how Anthropic structures agent prompts, tool assignments, and orchestration logic for production use cases. That’s a template you can adapt.

The specific thing worth doing today: write a rubric for one recurring task your team does. One page. Specific criteria. Then build the simplest possible grading agent that scores against it. You don’t need managed agents or Dreaming or multi-agent orchestration to start. You need a rubric and a grading step.

That’s where the 10% improvement comes from. Not from a better model. From a second agent that checks the first one’s work.

Churnney’s company has zero manually written code. That didn’t happen because the model got smarter. It happened because they built the harness.

Presented by MindStudio

No spam. Unsubscribe anytime.