Anthropic's Compute Shortage: Why Claude Limits Are Getting Worse

The Compute Crisis Nobody Wanted to Talk About

If you’ve been using Claude regularly in 2026, you’ve probably noticed something: the quotas are tighter than they were a year ago. Pro plan users are hitting walls faster. API rate limits feel more restrictive. And if you’re running agentic workflows or heavy Claude Code sessions, you’re likely burning through your allowance in hours rather than days.

This isn’t a billing glitch or a policy change for its own sake. Anthropic has a genuine compute shortage — and the decisions that created it were made years ago, when the company was still small enough that “we’ll figure out infrastructure later” seemed reasonable.

Here’s what’s actually happening, why it happened, and what it means if Claude is part of your development stack.

How Anthropic Got Into This Position

Anthropic was founded in 2021 by former OpenAI researchers. From the start, its positioning was different from OpenAI and Google: safety-focused, research-first, cautious about scaling. That philosophy shaped everything — including how aggressively they invested in compute infrastructure.

OpenAI had a $1 billion check from Microsoft in 2019 and a deep compute partnership that gave them access to Azure’s global data center footprint. Google built its own TPU infrastructure over years. Both companies treated compute as a strategic asset and invested accordingly.

Anthropic’s approach was different. They raised capital in tranches, scaled more carefully, and relied heavily on cloud providers for inference rather than building out owned infrastructure at the same pace. That worked fine when Claude was a research model with limited public access. It became a problem when Claude 3 went viral.

The compute shortfall isn’t a secret. Anthropic’s CEO Dario Amodei has publicly described the company as compute-constrained and has pointed to it as a primary bottleneck on model development and deployment velocity. The company has raised billions since — including a massive funding round in late 2024 — but data center capacity doesn’t appear overnight. Ordering GPUs, signing colocation deals, and provisioning infrastructure takes 18 to 24 months at minimum. Money raised today turns into compute capacity in late 2026 or 2027.

In the meantime, demand has kept growing. And the shape of that demand has changed in ways that make the problem significantly worse.

Why Agentic Usage Is the Real Multiplier

A year ago, most Claude interactions looked like this: a user sends a prompt, Claude responds, done. Maybe a few back-and-forth turns. Total token consumption: a few thousand tokens per session, at most.

Agentic workloads look nothing like that.

Claude Code sessions can consume tens or hundreds of thousands of tokens in a single sitting. The model reads files, generates code, checks output, iterates, calls tools, maintains context across long sequences of steps. Each cycle burns tokens. Multiply that by thousands of active users running agent loops simultaneously, and the infrastructure math changes completely.

The 1M token context window that Claude now supports compounds this further. Longer context means more compute per request — not just proportionally, but super-linearly, due to how attention mechanisms scale with context length. A single request with a 500K token context can consume more compute than thousands of short-form requests.

Anthropic underestimated how fast this shift would happen. When they planned their infrastructure in 2022 and 2023, they modeled against typical LLM usage patterns. The agentic use case — where Claude isn’t just answering questions but actively executing multi-step tasks — wasn’t the primary scenario. Now it is, for a growing chunk of their highest-value users.

As inference costs have emerged as a genuine wall across the AI industry, Anthropic is dealing with that wall more acutely than most because they have less owned infrastructure to absorb demand spikes.

What’s Actually Getting Restricted

The tightening shows up in several places, depending on how you use Claude.

Claude.ai Pro and Max Plans

Consumer-facing plans have seen usage limits tighten over the past year. Pro users have reported that their daily message allowances with Opus models are lower than they used to be. Max plan users, who pay significantly more for expanded access, are still hitting limits — just at a higher threshold.

Anthropic has been transparent that limits exist but has been less specific about the mechanics. The general pattern is that heavier models (Opus) are more constrained than lighter ones (Haiku, Sonnet), and complex multi-turn sessions eat quotas faster than simple queries.

If you want to understand how Anthropic’s prompt caching affects your subscription limits, that’s worth reading — prompt caching can meaningfully stretch your allowance, but it only helps on repeated context, not novel requests.

API Rate Limits

On the API side, rate limits have been adjusted by tier. The effect is that developers building production systems need to either pay more for higher tiers or architect around the constraints more carefully.

Token-based pricing means that at the API level, you’re theoretically limited only by budget. But in practice, there are throughput limits (requests per minute, tokens per minute) that matter for real-time applications. Those have been squeezed as Anthropic tries to spread available capacity more evenly.

Claude Code Specifically

Claude Code has attracted a large, highly engaged user base of developers running continuous, token-heavy sessions. The Claude Code Ultra plan was in part a response to this — a higher tier for users who need more throughput. But even Ultra users have reported that sustained, heavy sessions eventually run into limits.

Managing token budgets in agentic workflows isn’t just a nice optimization anymore. For developers who depend on Claude Code for production workflows, it’s a core operational concern.

The OpenClaw Ban as a Signal

The Anthropic OpenClaw ban — where Anthropic blocked third-party harnesses from using OAuth to access Claude subscriptions — is worth reading in this light. Officially it was about terms of service compliance. But the practical effect was to cut off a class of high-volume, automated usage that was consuming disproportionate compute on consumer-tier pricing. It’s one of several signals that Anthropic is actively managing compute scarcity, not just reacting to it.

The Competitive Gap This Creates

Anthropic’s compute shortage creates a real problem in the context of competition.

OpenAI has Microsoft’s infrastructure. When OpenAI needs more capacity, it can lean on Azure at scale. Google has its own TPUs, its own data centers, and decades of experience running global infrastructure for billions of users. Both companies can absorb demand spikes more readily.

This isn’t just about who can serve more requests. It’s about which company can iterate faster. Training new models requires compute. Running safety evaluations requires compute. The lab that’s compute-constrained is the one that’s forced to make harder tradeoffs between research velocity and serving existing demand.

Anthropic has been open about this publicly. Dario Amodei has said the company needs more compute than it currently has and that scaling up infrastructure is a top priority. The fundraising reflects that: the company has been raising at a pace that suggests they understand the infrastructure gap and are trying to close it.

But closing it takes time. And in the meantime, users feel the constraint.

Why Limits Will Likely Get Worse Before They Get Better

Here’s the uncomfortable part: the next 12 to 18 months are probably the worst period for Claude availability, not the best.

Several factors converge to make this true.

Demand is still growing. Agentic workflows are becoming mainstream, not niche. Every enterprise team that adds an AI agent to their stack adds another source of continuous, high-volume token consumption. Anthropic’s user base is growing, and the average compute consumption per user is growing faster.

New capabilities require more compute. Each generation of Claude has been more capable — and more compute-hungry per request. The AI tipping point in capabilities that recent models represent came with a corresponding jump in inference costs. Better models are more expensive to run.

Data center buildout is slow. Even with aggressive investment, new capacity won’t come online until late 2026 at the earliest. The data center infrastructure constraints affecting the whole industry — energy, land, regulatory approvals — are real. Anthropic can write checks faster than data centers can be built.

The sub-agent era multiplies everything. The shift toward architectures where smaller, faster models handle sub-tasks while larger models handle reasoning means total token consumption per workflow is going up, not down. Even if individual requests get cheaper, the total compute consumed per user outcome keeps rising.

The tightening isn’t a temporary blip that goes away when Anthropic raises another round. It’s a structural condition that persists until owned infrastructure catches up with demand.

What This Means for Developers Building on Claude

If you’re building production systems that depend on Claude, this situation has direct implications for your architecture decisions.

Don’t Assume Unlimited Capacity

Any production system that treats Claude API access as essentially unlimited is making a bet that may not hold. Quotas change. Rate limits adjust. Systems that were built assuming generous throughput are suddenly failing when limits tighten.

Build rate limit handling into your architecture from the start. Implement retry logic with exponential backoff. Monitor your consumption and set alerts before you hit walls, not after.

Optimize Token Usage Aggressively

Optimizing token costs with multi-model routing is no longer optional if you’re running at scale. Sending every task to Opus when Haiku or Sonnet would do the job is both expensive and quota-intensive. Route by task complexity.

The Anthropic advisor strategy — using Opus as a high-level planner with Haiku or Sonnet handling execution — is a practical architecture that reduces per-workflow compute consumption without meaningfully sacrificing quality.

Build for Model Portability

The deeper lesson is that betting your entire stack on a single model provider creates fragility. If Claude limits tighten or pricing shifts, you want the ability to swap in an alternative without rebuilding your application from scratch.

Multi-LLM flexibility in your agent infrastructure isn’t just about cost — it’s about resilience. When Anthropic is compute-constrained, having the ability to route some workloads to Gemini or GPT-4o means your system keeps running.

How Remy Handles Model Dependency

This is where the underlying infrastructure matters.

Remy is model-agnostic by design. It uses the best model available for each task — today that includes Claude Opus for core agent reasoning, Sonnet for specialist tasks, and other models where they’re the right fit. But the application spec you write doesn’t lock you to any specific model.

When model availability or pricing changes — and with Anthropic’s compute constraints, both will keep changing — Remy adapts at the infrastructure level without requiring you to modify your application. Your spec stays the same. The compiled output gets better or more cost-efficient as the model landscape evolves.

This matters specifically in the context of Claude limits. If you’re building a full-stack application through Remy and Anthropic tightens capacity again, you’re not stuck scrambling to refactor your entire codebase to work with a different provider. The abstraction layer handles it.

You can try Remy at mindstudio.ai/remy.

What Anthropic Is Doing About It

To be fair to Anthropic, they’re not ignoring the problem.

The company has committed to major infrastructure investment and has been building out compute partnerships beyond its original arrangements. They’ve signed deals for access to a larger pool of GPU capacity, and the recent fundraising rounds have included explicit compute build-out as a stated use of funds.

On the product side, they’ve introduced pricing changes designed to better align compute costs with usage. Flat-rate long-context pricing was one such move — it changed the economics for users with large context needs in a way that made long-context usage more predictable for both Anthropic and developers.

Prompt caching is another piece of the efficiency puzzle. By caching repeated context and serving it without recomputation, Anthropic reduces the per-token compute cost for sessions with stable system prompts or repeated documents. This helps stretch capacity without building new hardware.

And the broader Anthropic platform strategy — building Claude Code, Co-Work, and related tools as first-party surfaces — is in part about capturing revenue from high-value users in a way that can fund the infrastructure investment required to serve them sustainably.

None of this resolves the shortage in the short term. But it suggests Anthropic understands the problem and is working on it with real urgency.

Frequently Asked Questions

Why are Claude limits getting tighter if Anthropic is raising more money?

Fundraising and capacity are on different timelines. Capital raised today takes 12 to 24 months to translate into deployed compute infrastructure — you need to order hardware, build or lease data center space, provision power, and deploy the systems. Demand is growing faster than the infrastructure that capital will eventually fund. The squeeze is real in the near term even if the trajectory is improving.

Does this affect the API the same way as Claude.ai subscriptions?

Both are affected, but differently. Claude.ai subscription limits are about fairness across a shared consumer pool — how many messages or tokens per day a given tier allows. API limits are about throughput (requests per minute, tokens per minute) by tier. API users generally have more flexibility through higher-tier access, but both ultimately reflect the same underlying constraint: total available inference capacity.

Will Claude 4 or newer models make the shortage worse?

Probably yes, in the short term. Each new model generation tends to require more compute per request than its predecessor, because capability improvements come partly from larger architectures and longer thinking chains. What makes Claude 4.5 different includes reasoning improvements that are computationally expensive. Over time, efficiency improvements reduce per-token costs — but at launch, new flagship models typically increase per-request compute, not decrease it.

Should I migrate off Claude entirely because of this?

That’s probably the wrong frame. Claude is still one of the most capable models available, and Anthropic’s compute situation is improving. The better question is: should you architect so that you could migrate or route to alternatives if needed? Yes. Comparing Claude, ChatGPT, and Gemini for your specific use case is worth doing, not to replace Claude wholesale, but to understand your options if Anthropic’s constraints affect your workloads.

How can I get more out of my existing Claude quota?

A few practical approaches: use prompt caching for sessions with repeated context, route simpler tasks to Haiku or Sonnet rather than Opus, implement session management to avoid redundant context in long conversations, and manage your Claude session limits proactively rather than reactively. Token optimization should be treated as a first-class engineering concern, not an afterthought.

Is Anthropic’s compute shortage unique, or is this an industry-wide problem?

It’s both. The whole industry is dealing with inference cost pressure and infrastructure constraints. But Anthropic’s position is more acute because they have a smaller owned infrastructure footprint than OpenAI or Google. The compute paradox affecting small AI builders applies to Anthropic itself — they’re a large company by most measures, but still dependent on third-party compute in ways that Google and Microsoft are not.

Key Takeaways

Anthropic underinvested in compute infrastructure relative to the demand its models would generate, and the gap is now visible in tightening quotas and rate limits.
Agentic use cases — Claude Code, multi-step agent loops, long-context workloads — consume dramatically more compute per user than traditional chat usage, multiplying the pressure.
New infrastructure investment will take 12 to 24 months to translate into available capacity, meaning limits will likely stay tight or get tighter through 2026.
Developers building on Claude should build rate limit handling into their systems, optimize token usage aggressively, and architect for model portability.
Anthropic is actively working on the problem — prompt caching, tiered pricing, infrastructure investment — but none of it resolves the shortage quickly.
The most durable response is to stop treating any single model provider as unlimited infrastructure, and build systems that can adapt as the compute landscape shifts.

If you want to build applications that aren’t hostage to any single model provider’s capacity constraints, try Remy — spec-driven development with model-agnostic infrastructure underneath.