What Is the AI Token Cost Crisis? Why Uber Burned Its Entire AI Budget in 4 Months
Uber's engineers spent $500–$2,000/month each on AI tokens. Learn why token costs are exploding in agentic workflows and how to manage them.
The Hidden Tax Eating Enterprise AI Budgets
When Uber’s engineering leadership reviewed their AI spending last year, they found something alarming: individual engineers were burning through $500 to $2,000 per month each on AI tokens — just in development and testing workflows. Some teams had consumed their entire quarterly AI budget within four months of the year starting.
This wasn’t reckless spending. These were engineers doing their jobs — testing prompts, building internal tools, running automated workflows. The cost spiral happened because nobody had designed the systems with token consumption in mind.
The AI token cost crisis is a real problem hitting enterprises right now. It’s not a hypothetical future risk. Companies are watching their AI budgets evaporate at 3–10x the rate they projected, and most teams don’t fully understand why it’s happening or what to do about it.
This article explains what’s driving the explosion in enterprise AI token costs, why agentic workflows are the primary culprit, and what you can actually do to get costs under control.
What Tokens Are and Why They Add Up Fast
Before getting into the crisis itself, it helps to understand the basic economics.
When you send text to an LLM like GPT-4 or Claude, everything — your instructions, your input, the model’s response — gets converted into tokens. A token is roughly 0.75 words in English. The model charges you for both input tokens (what you send) and output tokens (what the model generates back).
For a single, simple API call, the cost is almost nothing. GPT-4o costs around $2.50 per million input tokens. A typical 500-word query with a 500-word response costs maybe $0.003. That’s fractions of a cent.
But that math breaks down completely the moment you move beyond one-off queries.
Where the Real Costs Live
The issue isn’t individual API calls — it’s what happens at scale and in complex workflows:
- Repeated context injection. Every time you call an LLM, you typically send the full system prompt, any relevant documents, and conversation history along with it. If your system prompt is 2,000 tokens, that 2,000 tokens gets charged on every single call.
- Document-heavy prompts. RAG (retrieval-augmented generation) systems pull in chunks of documents with each query. A single query might inject 10,000–50,000 tokens of context.
- High-volume automation. If an agent runs 1,000 times a day — processing emails, routing tickets, summarizing reports — costs multiply accordingly.
- Premium model usage. Not all models are priced equally. GPT-4o is roughly 15x more expensive per token than GPT-4o-mini. Using frontier models for tasks that don’t require them is one of the fastest ways to overspend.
None of these are problems in isolation. The crisis happens when they combine — which is exactly what happens in agentic workflows.
Why Agentic Workflows Break the Cost Model
Here’s the thing that catches most teams off guard: an agent doesn’t make one LLM call. It makes many.
A simple AI agent that helps a support rep handle a customer complaint might:
- Call the LLM to understand the issue (1 call)
- Query a knowledge base and re-call the LLM with retrieved context (1 call)
- Draft a response (1 call)
- Check if the response meets quality criteria (1 call)
- If not, revise and check again (2+ more calls)
That’s 5–7 LLM calls for a single task. Each call carries the full context window — system prompt, documents, conversation history.
Now imagine that agent handles 500 support tickets a day. What looked like a simple automation is actually 2,500–3,500 LLM calls daily, each potentially carrying 10,000+ tokens. The math gets ugly fast.
Multi-Agent Systems Multiply the Problem
Modern enterprise AI architectures often use multiple specialized agents working together — an orchestrator agent that coordinates several sub-agents, each with their own system prompts and context.
When an orchestrator passes a task to a sub-agent, it typically includes:
- Its own understanding of the task (tokens)
- Relevant context from previous steps (tokens)
- Instructions for the sub-agent (tokens)
- The sub-agent’s output gets passed back (tokens)
Each handoff between agents is a new LLM call with a full context payload. A multi-agent pipeline that looks elegant on a whiteboard can become an extremely expensive operation in production.
This is a big part of what happened at Uber. Engineers building with multi-agent architectures didn’t realize how quickly the call count — and the cost — would compound.
The Context Window Trap
As LLM context windows have expanded (some models now support 1–2 million tokens), teams have started using them more aggressively. The temptation is real: why bother with complex retrieval when you can just stuff the entire codebase or document library into the context?
Everyone else built a construction worker.
We built the contractor.
One file at a time.
UI, API, database, deploy.
The problem is that longer contexts don’t just cost more per call — many providers charge disproportionately more for very long contexts. Anthropic’s Claude, for example, uses a tiered pricing structure where tokens beyond a certain threshold cost more. Sending a 500,000-token context on every call is economically ruinous at any meaningful volume.
The Six Patterns That Cause Token Cost Explosions
Looking across enterprise AI deployments, there are six recurring patterns that cause costs to spiral:
1. Over-using frontier models. Teams default to GPT-4o or Claude Opus for everything — including tasks where a smaller, cheaper model would perform just as well. The 15–40x price difference between frontier and mid-tier models means this choice alone can 10x your costs.
2. Unoptimized system prompts. System prompts often balloon over time as developers keep adding instructions. A 5,000-token system prompt that fires on every call adds $12.50 per million calls just for the system prompt alone — before you count any actual content.
3. Missing caching. Most leading API providers offer prompt caching — if you send the same prefix repeatedly (like a static system prompt), they cache it and charge significantly less for cached tokens. Many teams don’t set this up.
4. No token budget per workflow. Without explicit limits, agents will use as many tokens as they need. An agent that’s slightly misconfigured can enter a loop, making dozens of calls on a single task with no circuit breaker to stop it.
5. Development and production on the same billing. Engineers testing workflows locally or in staging against production-tier models burn real budget. Uber’s problem was partly this — development usage wasn’t separated from production budgets.
6. No cost attribution. If you can’t see which workflows, teams, or use cases are driving spend, you can’t fix anything. Many teams operate blind because they’re looking at a single API bill with no breakdown.
What the Numbers Actually Look Like
To make this concrete, here’s a rough cost model for a common enterprise use case: an AI agent that processes incoming sales emails, extracts key information, and creates CRM entries.
Assume:
- 500 emails processed per day
- 3 LLM calls per email (parse, extract, validate)
- Average 4,000 tokens per call (system prompt + email content + instructions)
- Using GPT-4o at $2.50/million input tokens
Daily token count: 500 × 3 × 4,000 = 6,000,000 tokens Daily cost: $15 Monthly cost: ~$450
That’s manageable. Now add a validation loop that occasionally fires (say, 20% of emails need a second pass), use Claude Opus instead of GPT-4o, and let the system prompt grow to 8,000 tokens:
The same workflow now costs $3,000–$4,000/month.
Scale that across a dozen different automation workflows running across a 500-person engineering org, and you’re looking at serious budget exposure — exactly the situation Uber found itself in.
How to Actually Control Token Costs
The good news is that token cost optimization is an engineering problem with real solutions. Here’s what actually works.
Right-Size Your Models
Not every task needs a frontier model. A rough hierarchy:
- Complex reasoning, nuanced writing, code generation → Frontier models (GPT-4o, Claude Sonnet, Gemini 1.5 Pro)
- Classification, extraction, summarization → Mid-tier models (GPT-4o-mini, Claude Haiku, Gemini Flash)
- Simple routing, formatting, validation → Small/fast models or even rule-based logic
Seven tools to build an app. Or just Remy.
Editor, preview, AI agents, deploy — all in one tab. Nothing to install.
Running classification tasks on GPT-4o-mini instead of GPT-4o typically cuts costs by 15–30x with negligible quality difference for straightforward tasks.
Implement Prompt Caching
OpenAI, Anthropic, and Google all offer some form of prompt caching. When you structure your prompts so the static parts (system prompt, background context, instructions) come first and the variable parts (the actual input) come last, the provider can cache the static portion and charge you a fraction of the normal price for it on repeated calls.
This single optimization can cut costs by 60–80% for workflows with long, stable system prompts.
Compress and Summarize Context
Instead of carrying full conversation history forward, summarize it. Instead of injecting entire documents, use retrieval to pull only relevant chunks. A well-tuned RAG system that injects 2,000 tokens of relevant context will often outperform one that injects 20,000 tokens of a full document — and at a tenth of the cost.
Set Hard Token Budgets
Every workflow should have a maximum token budget that triggers an alert or stops execution if exceeded. This prevents runaway loops and surface issues in testing before they become production problems.
Most orchestration frameworks support this at the agent level. If yours doesn’t, it’s worth building a lightweight wrapper that tracks token usage per run and kills the task if it exceeds a threshold.
Separate Dev, Staging, and Production
Use smaller, cheaper models in development. Test with synthetic data of realistic volume. Only run full-scale tests against production-tier models when you’re validating something specific. This alone can cut engineering team spend significantly.
Build a Cost Dashboard
You can’t manage what you can’t measure. Set up token usage tracking per workflow, per team, and per use case. Most API providers offer usage export APIs. A simple dashboard showing daily cost trends by category will surface problems before they become crises.
How MindStudio Approaches Token Cost Management
One of the structural challenges with token cost management is that most teams are stitching together LLM calls across multiple frameworks, models, and tools — which makes tracking and optimization difficult.
MindStudio’s visual workflow builder takes a different approach. Because every workflow is defined declaratively — not written in ad hoc code — the platform has visibility into every model call happening in a workflow. That means you can see token usage per step, identify which parts of a workflow are expensive, and swap models without rewriting anything.
The model-switching feature is particularly useful for cost management. You can have a workflow that uses Claude Sonnet for a complex reasoning step and routes a simpler classification step to Claude Haiku — all in the same workflow, with no extra code. Since MindStudio gives you access to 200+ models without separate API keys, you’re not locked into one provider’s pricing.
For teams running high-volume automation workflows, this kind of granular control makes a significant difference. You can build and deploy a cost-conscious agentic workflow in well under an hour, starting from MindStudio’s free tier, without needing to manage your own prompt caching or cost tracking infrastructure.
Frequently Asked Questions
What are AI tokens and why do they cost money?
- ✕a coding agent
- ✕no-code
- ✕vibe coding
- ✕a faster Cursor
The one that tells the coding agents what to build.
Tokens are the units LLMs use to process text. One token is roughly 0.75 words in English. AI providers charge per token for both input (what you send to the model) and output (what the model generates). Costs are typically quoted per million tokens. The charges seem small in isolation but add up quickly at scale or in multi-step workflows.
Why did Uber spend so much on AI tokens?
Uber’s engineers were building and testing complex, multi-step AI workflows — the kind that make dozens of LLM calls per task. With hundreds of engineers working on AI tooling simultaneously, and without strong cost attribution or per-engineer limits, usage accumulated rapidly. Individual engineers were spending $500–$2,000/month each, and teams burned through quarterly budgets in months.
Are agentic AI workflows really that much more expensive than simple API calls?
Yes, significantly. A single-turn API call makes one LLM request. An agentic workflow might make 5–20 LLM calls to complete a single task, each carrying the full context window (system prompt, history, documents). At any meaningful volume, this compounds dramatically — a 10-step agent handling 1,000 tasks/day is making 10,000+ LLM calls daily.
What’s the most effective way to reduce AI token costs?
The highest-impact actions are: (1) right-sizing models — use smaller, cheaper models for simpler subtasks; (2) implementing prompt caching to avoid being charged full price for repeated static context; and (3) compressing context windows by summarizing history and using retrieval instead of full document injection. Together, these can reduce costs by 60–80% without meaningful quality loss.
How do I know which workflows are costing the most?
You need cost attribution at the workflow level. Most API providers offer usage APIs or CSV exports. Build or buy a simple dashboard that breaks down token usage by workflow, team, and time period. Without this visibility, it’s nearly impossible to prioritize optimization work. Many teams discover that 20% of their workflows are responsible for 80% of their costs.
Is token cost optimization worth the engineering effort?
At low volumes, no — the economics don’t justify it. But once you’re running production AI workflows at scale (millions of tokens per day), even a 50% reduction in costs can save thousands of dollars per month. The optimizations covered here — model routing, caching, context compression — are also generally good engineering practices that improve latency and reliability, not just cost.
Key Takeaways
- Token costs are negligible for single queries but compound dramatically in agentic workflows where each task triggers 5–20 LLM calls, each carrying full context
- Uber’s cost crisis was driven by multi-agent architectures, unoptimized system prompts, no per-team budgets, and mixing development/production spending
- The six main cost drivers are: over-using frontier models, bloated system prompts, missing caching, no token budgets, combined dev/prod billing, and no cost attribution
- The most impactful fixes are model right-sizing (15–30x cost difference between tiers), prompt caching (60–80% savings on repeated context), and context compression
- Cost management requires visibility — you can’t fix what you can’t measure
If you’re building AI workflows and want cost-efficient model routing built into the platform from the start, try MindStudio free at mindstudio.ai.
