Skip to main content
MindStudio
Pricing
Blog About
My Workspace

Goldman Sachs Says AI Inference Is Approaching 10% of Payroll — 5 Steps to Audit Your Exposure Now

Goldman Sachs reports inference costs nearing 10% of headcount. Abacus AI says their AI bill beats payroll in 6 months. Here's your cost audit playbook.

MindStudio Team RSS
Goldman Sachs Says AI Inference Is Approaching 10% of Payroll — 5 Steps to Audit Your Exposure Now

Goldman Sachs Just Told You Something Your CFO Hasn’t Processed Yet

Goldman Sachs reported that AI inference costs are approaching 10% of total headcount costs at companies running serious AI workloads. That number landed quietly in a research note. Most finance teams haven’t modeled it. Most engineering teams don’t know it exists.

Abacus AI made it more concrete: “Our AI bill will overtake payroll in 6 months.” They’re not a small shop experimenting with chatbots. They’re an AI-native company that builds on top of these models professionally. If they’re hitting that threshold, you should assume you’re closer to it than your current dashboards suggest.

This is the audit problem. Not the cost problem — the audit problem. Most companies don’t know what they’re spending on inference, broken down by use case, model, or team. They have a line item. They don’t have a map.

The map is what you need before the spike, not after.


Why the Number Feels Abstract Until It Isn’t

Ten percent of headcount costs sounds like a lot. It is a lot. But the reason it sneaks up on organizations is structural, not accidental.

Everyone else built a construction worker.
We built the contractor.

🦺
CODING AGENT
Types the code you tell it to.
One file at a time.
🧠
CONTRACTOR · REMY
Runs the entire build.
UI, API, database, deploy.

For the past two to three years, the major AI labs subsidized usage. Not metaphorically — literally. Microsoft’s GitHub Copilot was absorbing inference costs behind the scenes, pricing on requests rather than tokens, and offering flat-fee subscriptions that made heavy usage feel free. When GitHub announced its shift to consumption-based fees in May 2026, the new multiplier table made the subsidy visible for the first time: Claude Opus 4.7 jumped from a 7.5x multiplier to 27x. Gemini 3.1 Pro and GPT-5.3 Codex went from 1x to 6x. Microsoft had been eating roughly a 3.6x subsidy on every Opus token. That’s not a rounding error. That’s a structural transfer payment from Microsoft to its enterprise customers, and it just ended.

GitHub’s CPO Mario Rodriguez explained the logic plainly: “Today, a quick chat question and a multi-hour autonomous coding session can cost the user the same amount. GitHub has absorbed much of the escalating inference cost behind that usage, but the current premium request model is no longer sustainable.”

The same dynamic played out at Anthropic. Boris Cherny, an Anthropic researcher, wrote: “Our subscriptions weren’t built for the usage patterns of these third-party tools.” Replit moved to usage-based pricing in summer 2026 and absorbed significant backlash for being early. Now everyone is following.

The subsidy era is over. The question is whether your cost model knows that yet.


What the Evidence Actually Shows

The token consumption numbers are the part that makes this real.

One power user — the narrator of the AI Daily Brief — consumed approximately one billion tokens in a single month. That’s roughly 7,500 books worth of words. He’s an outlier, but outliers reveal the shape of the distribution. When agentic coding tools run multi-hour autonomous sessions across entire repositories, token consumption doesn’t scale linearly with the number of users. It scales with the number of tasks those users delegate to agents. And that number is going up fast.

OpenAI’s Codex user base grew from 200,000 on January 1st to 4 million the week before GPT-5.5 launched — a 20x increase in roughly four months. That’s not user growth. That’s a demand curve changing shape. Each of those users isn’t sending one message a day. They’re running agents.

SemiAnalysis called Claude Code “the inflection point for AI agents” in February, predicting it would drive exceptional revenue growth for Anthropic. They were right about the usage. What they may have underestimated was how fast the compute constraints would bite back. Anthropic began metering compute during peak hours. Users complained they were hitting limits far too quickly. Ben Thompson at Stratechery suggested Anthropic’s reluctance to release their Mythos model widely might be less about security concerns and more about a simple compute shortage. Technology writer Tae Kim put it bluntly: “Anthropic vastly underestimated compute growth needs.”

Meanwhile, Meta headcount is down 10%, Microsoft’s is down 7%, and both companies are up 400% on AI CapEx. The opex-to-capex transition is real and it’s accelerating.

The Goldman Sachs number — inference approaching 10% of headcount costs — is a snapshot of where early movers already are. It’s a leading indicator for everyone else.


The Audit Most Companies Haven’t Run

Here’s the uncomfortable part: most companies don’t have the data to know where they stand relative to that 10% threshold.

They have a cloud bill. They have an API key. They might have a Slack channel where engineers post when something breaks. What they don’t have is a use-case-level breakdown of what each AI workflow costs per unit of output, which models are doing which tasks, and where premium models are doing work that cheaper models could handle.

That audit has five components, and most organizations have done none of them.

VIBE-CODED APP
Tangled. Half-built. Brittle.
AN APP, MANAGED BY REMY
UIReact + Tailwind
APIValidated routes
DBPostgres + auth
DEPLOYProduction-ready
Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

Find the spending leaks first. This means a use-case and task inventory — not a model inventory. The question isn’t “what models do we use?” It’s “what tasks are we running, and what model is handling each one?” The default behavior when building agentic systems is to reach for the most capable model to make the prototype work. That’s reasonable during development. It becomes expensive when it persists into production. Claude Opus 4.7 at $5 per million input tokens and $25 per million output tokens is the right tool for some tasks. It’s expensive overkill for document classification, routing decisions, or structured data extraction. If you’re running those tasks at Opus prices, you have a leak.

Run a cheap model bake-off. The goal here is to build a task-specific performance map for lower-cost models. DeepSeek V4, for instance, is priced at $1.74 per million input tokens and $3.48 per million output tokens — compared to GPT-5.5 at $5 input and $30 output. For tasks where the quality delta is small, that’s a significant cost reduction. The bake-off isn’t about finding the cheapest model. It’s about finding the cheapest model that clears your quality threshold for each specific task type. Those are different questions, and the second one requires actual testing.

Assign ownership. The bake-off is a one-time event. The model landscape changes every few weeks. New open-weight models appear. Pricing shifts. A model that was expensive last quarter might be cheap this quarter, or vice versa. Someone needs to own this continuously. The role is essentially competitive intelligence applied to model economics — tracking price changes, new releases, and quality benchmarks, and translating that into routing recommendations. If no one owns it, the decisions calcify around whatever was true when the system was first built.

Design for model switching. Most agentic systems are built with a single model hardcoded throughout. That’s an architectural choice that makes cost optimization nearly impossible later. The alternative is to build with model routing as a first-class concern — cheap models handle routine work, with escalation paths to more capable models when confidence is low, stakes are high, or the task is genuinely complex. This isn’t just a cost optimization. It’s a more honest architecture, because it acknowledges that not all tasks require the same capability level. Platforms like MindStudio make this kind of multi-model orchestration concrete: 200+ models available, with a visual builder for chaining agents and routing logic, so you’re not rewriting infrastructure every time you want to swap a model.

Make the costs visible. The last step is instrumentation. Build a cost scoreboard that tracks inference spend by use case, model, and team. Integrate it with quality metrics — escalation rate, correction rate, human review rate. The goal is to make the tradeoffs visible to the people making them. Engineers who can see that a particular workflow costs $0.40 per run at Opus and $0.03 per run at a smaller model, with equivalent output quality, will make different decisions than engineers who can’t see that data at all.


The Deeper Problem This Reveals

There’s a more interesting question underneath the cost audit question, and it’s worth sitting with.

In three consecutive monthly pulse surveys — January, February, and March — “cost savings” didn’t appear anywhere on the list of primary AI benefits. The top-ranked benefit was “new capabilities,” which grew from 21.9% to 29.3% as the primary benefit over that period. “Time savings” actually declined, from 19.7% to 12.7%.

Cursor
ChatGPT
Figma
Linear
GitHub
Vercel
Supabase
remy.msagent.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

This matters for how you think about the audit. If you’re running the audit purely as a cost-cutting exercise, you’re optimizing for the wrong thing. The companies that will get this right are the ones that use the audit to understand what they’re actually buying with each dollar of inference spend — which tasks are generating new capabilities, which are just automating existing workflows, and which are doing neither.

The Hermes.md billing incident at Anthropic is a useful illustration of what happens when cost controls are implemented without that understanding. A user on a $200/month Claude Max plan was charged an additional $200 in API fees because the string “hermes.md” appeared in their git commit history. Anthropic’s system detected what it thought was a third-party harness and routed the session to the API. The user hadn’t changed anything. The bug was in Anthropic’s detection logic. Anthropic eventually issued a refund and a month of credits, but only after the incident went viral. The lesson isn’t that Anthropic is malicious. The lesson is that when billing logic is opaque and cost controls are implemented at the infrastructure level without user visibility, you get surprises. The audit is partly about building the visibility that prevents those surprises from being surprises.

For teams building production systems on top of these models, the token management layer deserves the same engineering attention as any other infrastructure concern. There are concrete techniques for this — Claude Code token management approaches that can meaningfully extend session efficiency, and strategies for routing through lower-cost models without sacrificing output quality on tasks that don’t require frontier capability.

The escape hatch architecture point connects to something broader about how production AI systems should be built. If you’re writing specs for what an agent should do — the kind of structured, annotated markdown that captures intent precisely — tools like Remy treat that spec as the source of truth and compile it into a complete TypeScript stack, backend and all. The spec-first approach forces the kind of task-level clarity that makes cost routing decisions tractable later, because you’ve already articulated what each component is supposed to do.


What Changes When You Have the Map

The 110-person agriculture company that got banned from Anthropic with no explanation — appeal via Google Form, no response — is a cautionary tale about platform dependency. They didn’t know why they were banned. They didn’t have an alternative ready. Their daily workflows stopped.

That’s the other reason to run the audit now, before the spike. Not just to reduce costs, but to understand your exposure. Which workflows are locked to a single provider? Which could run on an open-weight model if a subscription got terminated or a price doubled? The effort level settings in Claude Code and local model alternatives aren’t just cost tools — they’re optionality. They give you somewhere to go.

The companies that will handle the end of the subsidy era well are the ones that have already mapped their inference spend to business outcomes. They know what each workflow costs, what it produces, and what the alternatives are. The companies that will struggle are the ones that treated AI as a flat-fee utility and are now discovering it’s a variable-cost infrastructure with real economics.

Goldman Sachs is telling you the number. Abacus AI is telling you the timeline. The audit is how you find out where you actually stand.

Presented by MindStudio

No spam. Unsubscribe anytime.