Speculative Decoding Explained: How Draft Models Make AI Agents Faster

Why AI Agents Feel Slow (And What’s Actually Fixing It)

If you’ve ever watched an AI agent work through a multi-step task and wondered why it takes so long, the answer usually comes down to one thing: token generation. Every word an LLM produces requires a full forward pass through a massive neural network. Do that hundreds of times per response, across dozens of steps in an agentic workflow, and the latency adds up fast.

Speculative decoding is one of the most effective techniques researchers and engineers have found to address this problem — and it’s increasingly baked into the infrastructure powering AI agents at scale. This article explains how it works, why it matters for agentic systems specifically, and what it means for teams building with large language models today.

The Core Bottleneck: How LLMs Generate Text

To understand speculative decoding, you need a clear picture of what’s slow in the first place.

Large language models generate text one token at a time. A token is roughly a word or word fragment — “running” might be one token, “spectacularly” might be two. Each token requires the model to:

Take the entire context (prompt + all previously generated tokens) as input
Run that input through every layer of the neural network
Produce a probability distribution over possible next tokens
Sample from that distribution to select the next token

Wondering what the Hermes hype is about? Free 60-minute primer

This process is called autoregressive generation. It’s sequential by design — you can’t generate token 10 until you have token 9.

Why Parallelism Doesn’t Solve This

Modern GPUs are built for parallel computation. Training a large model is embarrassingly parallelizable — you can process entire batches of text simultaneously. But inference is different.

During generation, each new token depends on the previous one. That dependency chain means you can’t simply throw more GPU cores at the problem the same way you can during training. The GPU sits mostly idle between token generations, waiting for the previous one to finish.

This is the fundamental constraint speculative decoding works around — without breaking that dependency.

What Speculative Decoding Actually Does

Speculative decoding was formalized in research from Google and DeepMind in 2023, though variations of the idea appeared earlier. The core insight is elegant: instead of having the large model generate every token itself, use a smaller, faster model to guess what the large model would say, then let the large model verify multiple guesses at once.

Here’s the sequence:

A draft model (small and fast) generates a sequence of candidate tokens — say, 4–8 tokens ahead
The target model (large and accurate) evaluates all those candidate tokens in a single forward pass
The target model accepts or rejects each candidate based on whether it would have produced the same token
Accepted tokens are kept. The first rejected token is replaced with a corrected sample from the target model, and the process restarts

The key is step 2. The large model can evaluate multiple tokens in parallel during a single forward pass, because at verification time, all the draft tokens are already known — they’re just being checked, not generated from scratch. This means you can potentially get 4–8 tokens for roughly the cost of 1–2 large model forward passes.

The Acceptance Rate Problem

This only works well if the draft model is actually good at predicting what the large model would say. If the draft model produces tokens the large model constantly rejects, you’re doing extra work for no gain.

The acceptance rate — the fraction of draft tokens the large model approves — is the key performance variable. A well-matched draft model on the right type of content can achieve acceptance rates above 80%, leading to 2–3x speedups. A poorly matched draft model might barely beat naive generation.

This is why speculative decoding works especially well in domains where output is somewhat predictable: code generation, structured data extraction, templated text, and repeated patterns. These are also, notably, very common tasks in enterprise AI agents.

What Happens When a Token Is Rejected

Rejection isn’t catastrophic — it’s handled gracefully. When the target model rejects a draft token at position k, it:

Discards all draft tokens from position k onward
Samples its own token at position k using its full probability distribution
Restarts the draft process from that new token

The final output is mathematically equivalent to what the target model would have produced on its own. This is the critical property: speculative decoding is lossless. You get the same output quality, just faster — at least in the ideal case.

The Math Behind the Speedup

You don’t need to work through all the derivations, but a rough intuition helps.

Let’s say the large model takes 100ms per token forward pass. A 7B draft model might take 5ms per token. If you draft 5 tokens and the acceptance rate is 75%, you expect roughly 3.75 tokens to be accepted per draft sequence.

Draft cost: 5 tokens × 5ms = 25ms
Verification cost: 1 forward pass of large model = 100ms
Total: 125ms for ~3.75 tokens
Effective cost per token: ~33ms vs. original 100ms

That’s roughly a 3x speedup in this simplified example. Real-world numbers vary significantly based on hardware, model sizes, acceptance rates, and implementation details — but 2–3x is a commonly cited range for well-matched model pairs on typical tasks.

For agents running dozens of LLM calls per workflow, that 2–3x multiplier compounds. A workflow that took 90 seconds might complete in 30–45 seconds. That’s the difference between a tool that feels responsive and one that feels like waiting.

Draft Models: What They Are and How They’re Chosen

The draft model is the component that does the speculative guessing. Choosing or training the right draft model is where most of the engineering challenge lives.

Off-the-Shelf Draft Models

Some model families are designed with speculative decoding in mind. Anthropic’s Claude uses a small draft model internally. Google’s Gemini family includes models at multiple scales that can be used together. Meta released smaller Llama models that work as drafters for larger Llama variants.

The general principle: the draft model should be from the same “family” as the target model, trained on similar data distributions. A CodeLlama 7B draft model will be better at predicting CodeLlama 70B outputs than a general-purpose small model, because their vocabularies and output styles are more aligned.

Self-Speculative Decoding

A variation called self-speculative decoding uses the target model itself to draft — specifically by skipping certain layers during the draft phase and running the full model for verification. This avoids the need for a separate draft model entirely, at the cost of some complexity in implementation.

Medusa and Multi-Head Approaches

Another approach, popularized by the Medusa research, adds extra prediction heads to the target model. Each head predicts tokens at different positions ahead simultaneously. This is similar in spirit to speculative decoding but doesn’t require a separate model — the extra heads are lightweight additions trained on top of the base model.

These variants matter because they affect what’s possible on different deployment setups. Not every team running AI agents has the infrastructure to manage two separate model serving endpoints. Self-speculative and Medusa-style approaches lower that barrier.

Why This Matters Specifically for AI Agents

Single-shot chatbot interactions benefit from speculative decoding, but the compounding effect on agentic workflows is where it really counts.

Latency Compounds Across Steps

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

An AI agent doesn’t make one LLM call — it makes many. A research agent might: interpret a query, generate search terms, evaluate results, summarize findings, draft a response, and review that draft. Each step involves one or more model calls. If each call is 2–3x faster, the entire pipeline is 2–3x faster.

For synchronous workflows where a human is waiting on the result, this is the difference between a 30-second wait and a 10-second wait. For asynchronous background agents, it means more work completed per unit of compute.

Tool Use and Reasoning Loops

Modern agents spend a lot of time generating structured output: function call arguments, JSON payloads, reasoning traces. These outputs tend to be highly predictable — a function that always takes the same argument structure will produce tokens that a well-trained draft model can anticipate reliably.

This is why agents tend to see better acceptance rates than general text generation. The structured, repetitive nature of agent outputs is exactly the condition where draft models shine.

Cost Efficiency

Faster generation doesn’t just mean better user experience — it means more efficient use of compute. If the same hardware can serve 3x as many inference requests per hour, that reduces cost per call. For teams running high-volume agentic workloads, that efficiency translates directly to infrastructure costs.

Speculative Decoding in Production: The Current State

Speculative decoding has moved from research paper to production deployment across several major inference providers.

Where It’s Deployed

Anthropic: Uses speculative decoding internally for Claude models, contributing to the low latency users experience with Claude Haiku as a draft for Sonnet/Opus.
Google DeepMind: Has published research on speculative decoding and integrates it in Gemini’s serving infrastructure.
vLLM and TGI: Both popular open-source inference servers support speculative decoding, making it accessible to teams self-hosting models.
Groq: Their custom LPU hardware achieves extremely low latency in part through techniques related to efficient token generation.

Limitations to Know

Speculative decoding isn’t universally applicable. A few scenarios where it underperforms:

Highly creative or unpredictable generation: When output is truly random (high temperature, unusual prompts), draft acceptance rates drop and the benefit shrinks.
Very short responses: The overhead of drafting adds cost that only pays off over longer sequences.
Mismatched model families: A draft model trained on different data than the target will have low acceptance rates.
Memory-constrained environments: Running two models simultaneously requires more GPU memory than running one.

For most enterprise agent workloads — structured tasks, consistent prompt patterns, moderate-length outputs — these limitations rarely apply.

How MindStudio Handles Model Speed and Selection

When you’re building AI agents, you generally shouldn’t have to think about speculative decoding at all. The infrastructure handling it should be invisible.

That’s the approach MindStudio takes. The platform gives you access to over 200 AI models — including Claude, GPT-4o, Gemini, and others — through a single interface, without needing separate API accounts or managing inference infrastructure yourself. The speed optimizations (including speculative decoding where providers implement it) are handled at the provider level.

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

What this means practically: if you’re building an agent that needs to be fast, you can experiment with different model configurations directly in MindStudio’s visual builder. You might run Claude Haiku for quick classification steps and Sonnet for deeper reasoning steps within the same workflow — a manual version of the large/small model pairing that speculative decoding automates internally.

For teams running high-volume agent workflows, MindStudio also supports background agents that run on a schedule, webhook-triggered agents, and agentic pipelines that chain multiple models and tools together. Latency improvements from techniques like speculative decoding directly improve how quickly these workflows complete and how much compute they consume.

You can start building on MindStudio for free at mindstudio.ai — no API keys required to get started.

Frequently Asked Questions

What is speculative decoding in simple terms?

Speculative decoding is a technique that speeds up large language model output by using a small, fast model to guess what the large model would say, then having the large model verify those guesses in one step. Correct guesses are kept; incorrect ones are replaced. The result is faster output with no loss in quality.

Does speculative decoding change the quality of AI outputs?

No. When implemented correctly, speculative decoding is mathematically equivalent to having the large (target) model generate every token directly. The verification step ensures that any draft token the target model wouldn’t have produced gets corrected. You get the same output quality, just generated faster.

How much faster does speculative decoding make AI models?

Typical speedups are in the 2–3x range for well-matched model pairs on appropriate tasks. Structured outputs like code and JSON tend to see higher speedups (sometimes 3x or more). Creative or unpredictable text generation sees lower gains. Real-world numbers depend heavily on the specific models, hardware, and task type.

What’s the difference between speculative decoding and just using a smaller model?

Using a small model directly gives you fast output but lower quality — you’re accepting the small model’s capabilities, not the large model’s. Speculative decoding uses the small model only as a draft generator. The large model still controls the final output and corrects any mistakes. You get the large model’s quality at a speed closer to the small model’s.

Do I need to set up speculative decoding myself to benefit from it?

Usually not. Major AI providers like Anthropic and Google implement speculative decoding at the infrastructure level — it’s transparent to API users. If you’re self-hosting models with frameworks like vLLM, you can configure speculative decoding directly. For most teams using hosted model APIs, the optimization is already running.

Why is speculative decoding especially useful for AI agents?

AI agents make many sequential LLM calls — each step in a workflow is a separate inference request. Faster generation at each step compounds across the whole workflow, making the overall agent significantly more responsive. Agents also tend to generate structured, predictable output (JSON, function calls, templated text), which is exactly the condition where draft models achieve high acceptance rates and maximum speedup.

Key Takeaways

Speculative decoding solves the sequential token generation bottleneck in LLMs by letting a small draft model predict tokens that a large model then verifies in parallel.
The output quality is identical to running the large model alone — only the speed changes.
Typical speedups are 2–3x, with higher gains on structured or predictable outputs.
AI agents benefit more than single-turn chat because latency compounds across multi-step workflows.
Most production AI providers already implement this technique transparently — you may already be benefiting from it.
When building agents, choosing the right model for each workflow step — balancing speed and capability — has a direct effect on how fast and cost-efficient your agents run.

Catch up on Hermes — free 60-minute live workshop

If you’re building AI agents and want a platform that handles model selection, infrastructure, and workflow orchestration without requiring deep ML engineering knowledge, MindStudio is worth exploring. The average agent build takes under an hour, and all major models are available out of the box.