A Google Principal Engineer Said Claude Code Beat a Year of Her Team's Work in 1 Hour — Here's How to Use That to Get Approved

A Google Principal Engineer’s Post Got 9 Million Views — Here’s How to Use It in Your Next IT Meeting

Janna Dogen, a principal engineer at Google working on the Gemini API team, posted that Claude Code produced something close to what her team had built over a year — in roughly one hour. That post got approximately 9 million views. If you’re trying to get Claude approved at a company that defaulted to Copilot, that number is your opening argument.

But you can’t just paste the LinkedIn post into a Slack message and expect procurement to move. The reason Dogen’s post went viral is the same reason it won’t work as a standalone argument inside your company: it sounds like a preference claim. “Claude is better” is not a sentence that travels through an org. “For this specific job, the default costs us four extra hours a week compared with a specialist, and I can prove it” is.

This post is about how to build that proof, and how to use Dogen’s story as the frame rather than the argument.

What Dogen Actually Said (And What It Means for Your Case)

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

The precise version matters here. Dogen didn’t say Claude shipped Google’s production system in an hour. She gave Claude Code a description of a distributed agent orchestrator problem her team had been working through, and Claude produced a prototype close to what they’d built — from someone who already understood the problem deeply enough to judge the output instantly.

That last part is the point. An expert could see the delta right away because she knew what good looked like. She wasn’t measuring tokens per dollar or output length. She was measuring whether the thing worked.

That’s the measurement you need inside your company. Not a benchmark. Not a vendor demo. A person who knows the work, running the same job through the default and through the specialist, and reporting what they see.

The 9 million views tell you something else: this resonated because it wasn’t an isolated experience. Engineers everywhere recognized it. The workaround culture — cursor on a personal account, Perplexity on an expense report nobody wants to file, ChatGPT Plus running alongside the approved tool — exists because the gap is real and widely felt. Dogen just said it out loud with enough specificity that people couldn’t dismiss it.

Why Your Current Argument Isn’t Landing

The company hears “Claude is better than Copilot” as a preference. And from far away, the tools do look equivalent: chat interface, enterprise plan, security review, model underneath. The category is broad enough that “AI tool” doesn’t tell you whether it can do a meaningful job, in the same way “place with numbers” doesn’t tell you whether to use Excel or a data warehouse.

There’s also a structural problem. Microsoft Copilot has around 20 million paid enterprise seats. Office 365 has roughly 320 million paid seats. That’s about 6% penetration into the addressable base — which means most Copilot deployments are happening because it came bundled, not because anyone ran a shootout. The company picked the default because it was already in the stack, and that was a defensible decision. Vendor consolidation is real. Volume discounts are real. Compliance review is real.

Your argument fails not because you’re wrong but because you’re making the wrong kind of claim. “The tool is bad” is a complaint. “For this job, the default costs us X hours a week and I can prove it” is a business case. Those travel through organizations very differently.

The hidden tax of a bad AI default is paid in 30-minute chunks — the cleanup pass, the second draft, the manual check after the code review misses something. It never shows up as a line item. Procurement doesn’t see it. Your manager might not see it. The only person who sees it is you, which is why the measurement has to start with you.

The Wealthsimple Model: How a Real Company Did This

If you need a corporate precedent beyond Dogen’s post, the Wealthsimple story is useful. Gergely Orosz at the Pragmatic Engineer reported on how Wealthsimple CTO Dedric Vanlier approached AI developer tools across a team of roughly 600 engineers.

Not a coding agent. A product manager.

Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.

BY MINDSTUDIO

Their coding tool decision wasn’t made by gut feel. They ran a structured shootout for code review tooling, and for the broader coding tool decision, they backed it with behavioral usage data from Jellyfish — specifically looking at which tools engineers were actually using versus abandoning. That distinction matters. Adoption data shows what people reach for when they have a choice. It’s harder to dismiss than a survey.

Nobody has perfectly solved AI productivity measurement. Lines of AI-generated code can be a vanity metric. Velocity can move for reasons unrelated to AI. But that doesn’t mean you don’t measure — it means you measure closer to the work. What did the agent produce? How much rework did it need? Would the person doing the job use the output?

That informal measurement, done at the individual contributor level, creates the shape of an answer. A CTO like Vanlier can then commission a formal measurement on top of it. But the useful signal starts with the people who know what good looks like.

The Sales Ops Example: What Measurement Actually Looks Like

Here’s a concrete version of this. A sales ops lead at a company that defaulted to Copilot runs a pipeline hygiene report every Monday: deals without next steps, close dates that slipped more than twice, risk summaries, a brief for the revenue leadership Slack channel.

Under Copilot, she spends about 90 minutes getting the report to a standard she’s willing to send. The model writes fine sentences, but it struggles with deal history structure and keeps surfacing the wrong slip dates. Quality score: roughly 2–3 out of 5. Would she send it without cleanup? Usually no.

She runs the same job through a specialist agent wired to the same sources. First week: 20 minutes of cleanup. Second week: 10 minutes, because she’s getting better at working with it. Copilot still averages 90 minutes. The specialist averages 15. Quality score: 4 out of 5. The “would you send it” column flips from no to yes on most runs.

That’s the artifact. Not a complaint. A log with a job class, a delta, and a quality score. She multiplies it across the org, counts how many people run similar jobs, and sends in the ask.

The success criteria is always the same: did the agent do the job well enough to substitute for the work you were going to do anyway? Not whether the output looks polished. Whether you’d actually use it.

If you’re building the kind of specialist agent that would replace that Copilot workflow, platforms like MindStudio handle the orchestration layer — 200+ models, 1,000+ integrations including Salesforce and HubSpot, and a visual builder for chaining the data sources and model calls that a pipeline hygiene report actually requires.

How to Frame the Ask Without Attacking the Default

Most people make the ask too big. “Let’s replace Copilot” loses almost every time, because the company has legitimate reasons for picking the default. Don’t ask them to admit the whole decision was wrong.

Ask a smaller, sharper question: within our commitment to the default, what specific subset of work is the default doing worse than a specialist? That keeps the prior commitment intact. It gives everyone a way to say yes without reversing themselves.

Then the second question: what would it cost to add the specialist only for that subset?

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

remy.msagent.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

If your team does seven things and the default handles five of them adequately, don’t switch everything. Keep the default for the five. Add a specialist for the two where the work demands it. The correct answer in the agent layer is almost never one tool for everything — it’s routing. That’s not a violation of standardization. It’s a better standardization policy.

The altitude of the ask changes depending on who’s hearing it. At the IC-to-manager level, keep it small: here’s the log, here’s the delta, can I get an approved license for this one job? A lot of managers will engage with that. If they say no, they’ll usually give you a specific blocker — procurement, security review, budget timing — and a specific no is just the next problem to solve.

At the manager-to-director level, the ask becomes a pilot: three people ran this measurement, two show the same pattern, I want to pilot the specialist for this job class for a quarter and report back.

At the director-to-exec level, you’re not asking for a tool. You’re asking the company to commission measurement. The question is: how would we know if our AI default is costing us? The honest answer is: we’d only find out when our best people quietly leave for companies that give them better tools.

The Four Objections You’ll Get

“We already paid for it.” The license fee is a sunk cost. The question is whether an incremental specialist license for a bounded job returns more in reclaimed time than it costs. If it’s four hours a week per person, you can multiply that out.

“This is shadow IT.” Shadow IT is adopting a tool without disclosure and without review. What you’re doing is the opposite — you’re putting it in front of the company and asking for a formal process. That’s the definition of responsible disclosure.

“We need to standardize.” Companies already know that standardization doesn’t mean one tool for every job. They use Excel and Tableau and Looker for different analytics jobs. The agent layer is the same. Standardize on a default where it wins, use specialists where the work demands them, and measure the boundary.

“We won’t approve another vendor.” Sometimes that’s a real constraint. Sometimes it’s a reflex. The way to tell is to push on the specific blocker: is it data residency? Admin controls? A contract minimum? The only version that’s unworkable is “no because no.” If that’s the answer, there’s probably a retention problem coming.

On retention: people are leaving companies specifically because of inadequate AI tooling. This isn’t theoretical. Engineers who can’t get the tools they need are moving to companies that give them better defaults. Talent is concentrating in places where AI-native tooling is available, and that’s already a 2026 theme. If your company can’t make this case internally, that’s your company’s loss.

What to Actually Do This Week

Pick one job. Not the job that makes the best philosophical point — the job you can measure, that runs weekly, takes at least 30 minutes, where you can judge the output instantly, and where the output has a real audience. A team channel, a customer, a manager. That fourth criterion matters: if nobody sees the output, the company can dismiss it as a personal workflow preference.

Run that job through Copilot and through Claude. Measure time spent, rework required, quality score, and whether you’d actually send the result. Do it for a week. By the end, you have somewhere between five and fifteen data points — more real evidence about your work than anything produced during the original procurement decision.

For engineering-specific work, the Claude Code source code leak revealed 8 practical features worth knowing before you run your test — some of them directly affect how Claude Code handles the kind of complex reasoning tasks Dogen was describing. And if you’re thinking about how to structure the agent that would replace your current workflow, the GStack framework from Gary Tan is a useful reference for how solo developers are already structuring Claude Code projects at production scale.

If the job involves building something from scratch rather than running a recurring workflow, the abstraction question becomes relevant. Tools like Remy take a different approach: you write a spec — annotated markdown — and the full-stack app gets compiled from it. Backend, database, auth, deployment, all of it. The spec is the source of truth; the generated TypeScript is derived output. That’s a different kind of productivity argument than “Claude is faster at code review,” but it’s the same underlying logic: measure what the tool actually produces against what you’d have built otherwise.

The Dogen post got 9 million views because it was specific. A principal engineer on the Gemini API team, running a real problem she understood deeply, got a result she could judge instantly. That specificity is what made it credible. Your measurement needs the same quality: a real job, a real success criterion, a real audience for the output.

The emotionally satisfying version of this is to pound the table and say the AI revolution is proceeding without you. The version that actually works is to bring the log, state the delta, and ask for exactly what the data supports.

If you want to go deeper on how Claude Code handles parallel workstreams — which is relevant if you’re trying to demonstrate throughput advantages — Claude Code’s Git worktree support for parallel feature branches is worth understanding before you run your comparison. The gap between what Copilot and Claude can do on complex, multi-context tasks is where the measurement tends to be most legible.