Why Anthropic's 70% Inference Margins Matter for Your API Costs — And What to Expect Next

Anthropic’s Inference Margins Jumped From 38% to 70% in One Year — Here’s What That Means for Your API Bill

A year ago, Anthropic was keeping roughly 38 cents of every dollar it made from inference. Today, according to SemiAnalysis — who are considered extremely well-sourced on infrastructure economics — that number is 70%. Anthropic’s inference margins at 70%, up from 38% last year, is the single most important pricing signal in the API market right now, and most builders are reading it backwards.

The instinct is to see high margins and brace for price hikes. That’s the wrong model. What 70% margins actually tell you is something more interesting: the cost of running these models is falling faster than Anthropic is passing savings to customers. That gap is where your future pricing lives.

This post is about how to read that signal, what it implies for your architecture decisions over the next 12 months, and where the real cost risks are hiding.

What 70% Margins Actually Signal

Start with the mechanics. Inference margin is revenue minus the cost of compute to serve that inference, divided by revenue. When SemiAnalysis reported Anthropic’s margins at 38% last year, it meant compute was eating 62 cents of every dollar. At 70%, compute costs have dropped to 30 cents on the dollar.

That’s not a small shift. That’s a near-halving of the cost-to-serve in roughly twelve months.

The drivers are well-understood: better hardware utilization, custom silicon (Anthropic has been working with AWS Trainium and Google TPUs), improved batching, and model distillation that lets smaller, cheaper models handle more of the inference load. The models themselves are also getting more efficient at the architecture level — you get more capability per FLOP than you did a year ago.

The implication for pricing is that Anthropic has room to cut API prices significantly without touching profitability. They’ve already done this several times. Claude 3 Haiku launched at a price point that would have been considered aggressive for a model of its capability class eighteen months prior. The pattern is consistent: margins expand, then prices drop with a lag.

Why the Revenue Numbers Make This More Complicated

Here’s where it gets interesting for builders. Anthropic’s ARR reportedly exploded from $9B to over $44B in 2026, doubling roughly every six weeks according to SemiAnalysis. Analyst Ming Li calculated that Anthropic is adding approximately $96M in ARR per day.

To put that in context: AWS took 13 years to reach $35B in annual revenue. Salesforce took over 20 years to pass $20B. Anthropic is doing this in months.

That growth rate changes the margin math. When you’re doubling revenue every six weeks, you’re also doubling your compute spend — even at 70% margins, the absolute cost numbers are enormous. Anthropic committed to spending $200 billion with Google Cloud over five years. That’s not a rounding error.

So the picture is: margins are healthy and improving, but the company is simultaneously in a race to provision enough compute to serve demand that’s growing faster than almost any business in recorded history. The constraint isn’t profitability. The constraint is supply.

If you’ve noticed Claude rate limits tightening, this is why. The demand for tokens from enterprise and API users is outpacing Anthropic’s ability to provision capacity, even as they spend aggressively on infrastructure.

What This Means for Your API Cost Architecture Right Now

Given this backdrop, here’s how to think about your API cost exposure over the next year.

The price floor is falling, but unevenly. Frontier models — Opus-class — will stay expensive relative to their capability because demand is inelastic. Enterprise customers building on Claude for high-stakes workflows will pay whatever it costs. The price compression will happen fastest in the mid-tier and small-model categories, where Anthropic can use margin headroom to undercut competitors and drive adoption.

If your application can tolerate a Haiku-class model for most tasks, you’re in the best position. The cost-per-token for capable small models will continue dropping. If you’re locked into Opus for everything, you’re exposed to a different dynamic.

Token consumption is the new seat count. The shift from subscription seats to consumption-based pricing isn’t just a business model change — it’s a risk profile change for builders. A user who runs a heavy agentic workflow can generate 10x or 100x the token consumption of a casual user. If your pricing to customers is flat and your costs are consumption-based, you have a margin squeeze problem that no amount of Anthropic price cuts will fix.

This is worth modeling explicitly. Take your p95 user by token consumption, multiply by current API rates, and ask whether your pricing covers that case. Most builders who haven’t done this math are surprised by the answer.

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

The enterprise JV changes the competitive landscape. Anthropic’s new joint venture — backed by Blackstone (the world’s largest alternative asset manager), Hellman & Friedman, and Goldman Sachs, with additional backing from Apollo Global Management, General Atlantic, GIC, Leonard Green, and Suko Capital — is valued at $1.5 billion with a $300M commitment from the founding partners. This isn’t just a deployment play. It’s a signal that Anthropic is building a two-tier market: enterprise customers who get embedded engineering support and custom deployments, and API customers who get the standard pricing tiers.

The forward-deployed engineer model — where Anthropic embeds engineers inside client companies to ship real integration code — is explicitly borrowed from Palantir’s playbook. Palantir IPO’d at around $19 in 2021, dropped to $6 in 2022, then delivered a 640% return over five years largely on the strength of this model. The stickiness of deeply integrated AI systems is the whole point. Enterprise customers who get the FDE treatment will have custom harnesses, custom fine-tuning, and deeply embedded workflows. They will not be price-sensitive in the same way API developers are.

For you, as an API builder, this means the standard pricing tiers are probably not where Anthropic’s margin pressure will come from. The enterprise tier is where they’ll extract maximum value. The API tier is where they’ll compete on price to drive volume and developer adoption.

The Real Cost Risks Hiding in Your Stack

The 70% margin story is mostly good news for API costs over time. But there are three cost risks that the margin improvement doesn’t address.

Context window costs compound fast. Longer context windows are one of the biggest drivers of unexpected API bills. A 200K context window sounds like a feature, but if your application is naively stuffing full conversation history into every call, you’re paying for tokens that aren’t doing useful work. The cost scales linearly with context length, and most applications don’t need the full window most of the time.

The fix is explicit context management: summarization at conversation boundaries, retrieval-augmented generation for long documents rather than full-context stuffing, and aggressive pruning of system prompts. The Claude Code effort levels documentation is a good reference for understanding how model reasoning depth interacts with cost — the same principle applies to context management.

Model routing is underused. Most applications use one model for everything. This is almost always wrong from a cost perspective. Classification, extraction, and simple Q&A tasks don’t need Opus. Routing those to Haiku and reserving Sonnet or Opus for complex reasoning tasks can cut costs by 80-90% on the right workload mix.

If you’re building on MindStudio, this kind of multi-model routing is handled at the platform level — you get access to 200+ models and can set routing rules without writing orchestration code. For teams building custom stacks, the routing logic itself is straightforward but requires explicit instrumentation to know which tasks are actually going to which models.

Caching is still underimplemented. Anthropic’s prompt caching feature lets you cache the prefix of a prompt and pay a fraction of the cost on subsequent calls that share that prefix. For applications with long system prompts or shared context — which is most production applications — this is a significant lever. The discount is substantial: cached tokens cost roughly 10% of uncached tokens on Claude 3.5 Sonnet.

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

The implementation requires structuring your prompts so the stable prefix comes first and the variable content comes last. It’s a small architectural change with a large cost impact. If you haven’t implemented this yet, it’s probably the highest-ROI optimization available right now. The OpenRouter free model routing guide covers some of the same cost-reduction thinking from a different angle.

How to Think About Pricing Stability

One question I get asked a lot: should I lock in pricing commitments now, or wait for prices to fall?

The honest answer is that the trajectory is clearly downward for standard API tiers, but the timing is unpredictable. Anthropic has cut prices multiple times in the past two years, and the 70% margin figure suggests there’s room to cut again. But the supply constraint is real — if demand continues doubling every six weeks, Anthropic has less incentive to cut prices aggressively because they’re capacity-constrained, not demand-constrained.

My read is that the next 6-12 months will see continued price cuts on the smaller models (Haiku-class) and more modest movement on frontier models. The enterprise tier will stay expensive because it’s not really competing on price — it’s competing on deployment quality and integration depth.

For your architecture decisions: build for model portability. Don’t hard-code assumptions about which specific model you’ll use for a given task. Abstract the model selection layer so you can swap in cheaper alternatives as they become available. The Anthropic vs OpenAI vs Google agent strategy comparison is useful context here — the competitive dynamics between labs are a significant driver of price compression, and that competition is intensifying.

What the Enterprise Push Means for API Developers

There’s a version of this story where Anthropic’s enterprise JV is bad news for API developers: the company prioritizes its high-margin enterprise customers, API capacity gets squeezed, and the developer tier becomes a second-class experience.

I don’t think that’s the right read, but it’s worth taking seriously. The counter-argument is that Anthropic’s enterprise deployments are built on the same models and infrastructure as the API. Making the models better for enterprise customers makes them better for everyone. The forward-deployed engineers building custom harnesses for Blackstone are generating real-world feedback that improves the models.

The more relevant concern is capacity allocation. When tokens are scarce, who gets priority? Enterprise contracts with committed spend will almost certainly get better SLAs than pay-as-you-go API users. If you’re building a production application on the standard API tier, you should have a fallback model strategy for when Claude is rate-limited. The local model cost reduction guide covers one approach — running open-source models locally for tasks that don’t require frontier capability.

Tools like Remy take a different approach to the build layer entirely: you write an annotated markdown spec and it compiles into a complete TypeScript backend, database, auth, and deployment. The spec is the source of truth; the generated code is derived output. When the underlying model costs change, you’re not rewriting application logic — you’re recompiling from a spec that hasn’t changed.

The Number to Watch

SemiAnalysis reported 70% inference margins. That number will keep moving.

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

If it reaches 80%, expect another round of meaningful price cuts within 6-9 months. If it stalls or drops — which could happen if Anthropic has to provision expensive new capacity faster than efficiency gains accumulate — prices stay flat and the capacity constraint gets worse before it gets better.

The OpenAI development company comparison is instructive here. OpenAI is raising $4B from 19 investors against a $10B valuation for their own enterprise deployment vehicle, with zero investor overlap with Anthropic’s JV. Two parallel enterprise deployment machines, both capacity-constrained, both targeting different segments of the financial and enterprise market. The competition between them is the best structural guarantee that API prices will continue falling over time.

Watch the margin number. Watch the capacity announcements. And build your cost architecture assuming the models you’re using today will be 30-50% cheaper in 18 months — because if the last two years are any guide, they will be.