What Is Mercury 2? The Diffusion-Based Language Model That Runs 5x Faster
Mercury 2 from Inception Labs uses a diffusion process instead of autoregressive token generation, claiming 5x faster speeds than Claude Haiku.
Speed Has Always Been the Hidden Tax on AI
Every second a model takes to respond costs something — user patience, compute budget, or both. Autoregressive language models generate one token at a time, left to right, sequentially. That’s worked well enough, but it’s also a fundamental ceiling on how fast they can go.
Mercury 2 from Inception Labs takes a different approach entirely. Instead of sequential token generation, it uses a diffusion process — the same class of technique that powers image generators like Stable Diffusion — applied to language. The result is a model that Inception Labs claims runs roughly 5x faster than Claude Haiku 3.5 at comparable quality.
This article breaks down what Mercury 2 actually is, how diffusion language models work under the hood, how Mercury 2 performs against autoregressive alternatives, and what kinds of workloads actually benefit from this architecture.
What Is Mercury 2?
Mercury 2 is a large language model developed by Inception Labs, a company founded by researchers from institutions including Stanford and Carnegie Mellon. It’s the second generation of their Mercury model family, which also includes Mercury Coder 2 — a variant optimized specifically for code generation tasks.
Hire a contractor. Not another power tool.
Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.
The core differentiator is the architecture. Mercury 2 is a diffusion language model (often abbreviated dLLM), which means it doesn’t generate tokens one by one in a fixed left-to-right order. Instead, it starts with a “noisy” or masked sequence and iteratively refines it toward a coherent output — all tokens being updated in parallel across multiple passes.
Inception Labs positions Mercury 2 as a high-throughput model designed for applications where latency and speed are critical constraints: real-time coding assistants, high-volume document processing, agentic pipelines that chain multiple model calls, and any scenario where waiting on tokens is the bottleneck.
How Diffusion Language Models Actually Work
To understand Mercury 2, it helps to first understand why most LLMs are slow — and what diffusion does differently.
The Autoregressive Bottleneck
Standard language models like GPT-4o, Claude, or Llama are autoregressive. They generate text by predicting one token at a time, where each new token depends on all the tokens before it. This is sequential by design: you can’t generate token 50 until you’ve generated tokens 1 through 49.
That sequential dependency is both a feature (it makes outputs coherent) and a constraint (it means generation speed is limited by the number of tokens in the output, not just model size or hardware).
Diffusion: The Alternative
Diffusion models work differently. In image generation, diffusion starts with random noise and iteratively denoises it into a final image. Many updates happen in parallel across the entire canvas.
Diffusion language models apply a similar idea to text. Instead of noise in the pixel sense, text diffusion typically uses masking: the model starts with a sequence of masked or randomly replaced tokens, then progressively “fills in” the correct tokens over multiple denoising steps.
Because the model operates on the entire sequence at once — rather than left to right — many tokens can be resolved simultaneously. This is where the speed gains come from.
The Masking Process in Mercury 2
Mercury 2 uses a variant called masked diffusion (sometimes called absorbing diffusion). Here’s the simplified version of how it works:
- Start with a fully masked sequence of the target length.
- Run the model forward, which predicts confidence scores for each masked token.
- Unmask the tokens the model is most confident about.
- Repeat for the remaining masked tokens over several steps.
- After enough denoising steps, the full sequence is filled in.
This is fundamentally different from autoregression. The model doesn’t have to wait for the first half of the sentence to be “done” before working on the second half. Everything is being refined in parallel.
The tradeoff is that diffusion models typically require multiple forward passes (denoising steps) to produce a final output, whereas an autoregressive model only makes one pass through the network (though it does this once per token). The question is which is faster overall — and for Mercury 2, the parallel nature of diffusion wins decisively on throughput.
Mercury 2 vs. Autoregressive Models: The Speed Claims
Inception Labs claims Mercury 2 achieves around 1,000+ tokens per second — compared to roughly 200–250 tokens per second for Claude Haiku 3.5 under typical API conditions. That’s the basis for the “5x faster” headline.
Speed comparisons in LLMs are always context-dependent. Here are the main dimensions worth thinking through:
Throughput vs. Latency
- ✕a coding agent
- ✕no-code
- ✕vibe coding
- ✕a faster Cursor
The one that tells the coding agents what to build.
Mercury 2’s advantage is primarily in throughput — total tokens produced per second at scale. In batch processing scenarios, agentic pipelines, or high-volume API use, this matters a lot.
Latency to first token (time before any output appears) is a different metric. Autoregressive models can start streaming output almost immediately. Diffusion models, depending on implementation, may need to run several denoising passes before producing any visible output. Inception Labs has made progress on streaming Mercury 2 outputs, but it’s worth checking current API behavior if time-to-first-token matters for your use case.
Benchmark Comparisons
On coding benchmarks — where Mercury Coder 2 has been most prominently evaluated — Inception Labs reports competitive performance with models like Claude Haiku 3.5 and GPT-4o mini on tasks like HumanEval and SWE-bench lite variants. The model isn’t matching the largest frontier models (GPT-4o, Claude Sonnet 3.7), but it holds its own in the Haiku/mini tier while running significantly faster.
Independent evaluations have generally confirmed that Mercury 2 is genuinely fast and competitive at its tier, though benchmark performance varies by task type. General reasoning and instruction-following quality can lag behind top autoregressive models at similar capability levels.
Cost Efficiency
Because Mercury 2 generates more tokens per second per unit of compute, it tends to be cheaper to run at scale. This makes it attractive for high-volume applications — think processing thousands of documents, running automated test generation, or powering a coding assistant with many concurrent users.
Where Mercury 2 Shines (and Where It Doesn’t)
Like any model architecture, diffusion language models have real strengths and real limitations. Here’s an honest breakdown.
Strengths
High-throughput workloads. If you’re running hundreds or thousands of model calls in a pipeline, the speed advantage compounds quickly. Tasks that might take hours with a standard model can complete in a fraction of the time.
Code generation. Mercury Coder 2 has been specifically tuned for code tasks and performs well on completion, generation, and fill-in-the-middle tasks. The parallel nature of diffusion also suits code well — code often has structure across multiple positions that can be inferred simultaneously.
Cost-sensitive applications. For teams running AI features at scale, lower cost-per-token matters. Mercury 2’s throughput advantage translates directly into lower API bills at volume.
Agentic pipelines. In multi-step AI workflows where one model call triggers another, speed bottlenecks accumulate. A 5x faster model at each step doesn’t just save time — it changes what kinds of pipelines are practical to build.
Limitations
Quality ceiling. Mercury 2 is in the Haiku/mini tier, not the frontier tier. For complex reasoning, nuanced instruction-following, or tasks requiring deep world knowledge, GPT-4o or Claude Sonnet will produce better results.
Output coherence on long generations. Diffusion models can occasionally produce outputs with subtle incoherence — especially in long-form generation — because the parallel nature means some tokens are resolved without full context from later parts of the sequence. This is a known challenge for dLLMs that Inception Labs continues to work on.
Streaming behavior. Depending on your use case, the denoising-then-stream model may feel different from the token-by-token streaming of autoregressive models. For chat interfaces where users watch text appear in real time, this can affect perceived responsiveness.
Built like a system. Not vibe-coded.
Remy manages the project — every layer architected, not stitched together at the last second.
Ecosystem maturity. Diffusion language models are newer than autoregressive LLMs. The tooling, fine-tuning support, and community resources around Mercury 2 are thinner than what you’d find for GPT-4o or Claude.
Why Diffusion for Language? The Research Behind It
The idea of applying diffusion to language isn’t new — researchers have explored it since at least 2022 — but getting it to work well at the quality level of modern autoregressive models has been challenging.
The core insight from Inception Labs and other groups is that text diffusion can work with discrete tokens (as opposed to continuous embeddings) using masking as the “noise” process. This is sometimes called discrete diffusion or absorbing diffusion, and it sidesteps some of the issues with continuous diffusion applied to language.
Research from groups like Google Brain and MIT has demonstrated that masked diffusion LMs can achieve competitive perplexity scores on standard language modeling benchmarks, establishing the theoretical grounding that models like Mercury 2 build on.
Inception Labs’ contribution is scaling this approach — training larger diffusion LMs with better quality, faster inference, and practical usability through an API.
How to Access Mercury 2
Mercury 2 is available through Inception Labs’ API. Access options include:
- API access for developers building applications directly on the model.
- Mercury Coder 2 specifically for code-focused use cases, available through the same API.
- Integration into platforms that support custom or third-party model endpoints.
The API follows standard REST conventions and supports OpenAI-compatible formatting, making it relatively straightforward to swap in for existing code that calls autoregressive models.
Building Faster AI Workflows With MindStudio
Speed gains from a model like Mercury 2 matter most in context — specifically, in multi-step workflows where model calls are chained together. A single call being 5x faster is useful; an entire agent pipeline being 5x faster is significant.
MindStudio is a no-code platform that lets you build exactly those kinds of pipelines — AI agents that make multiple model calls, process data, trigger integrations, and act across tools like HubSpot, Slack, Google Workspace, and more. And because MindStudio gives you access to 200+ AI models out of the box, you can combine Mercury 2 (or any fast model) for the high-throughput steps in your workflow, while routing more complex reasoning tasks to GPT-4o or Claude Sonnet where quality matters more.
This kind of model routing — using different models for different tasks based on speed, cost, and quality requirements — is one of the most practical ways to get value from a model like Mercury 2. You don’t have to commit to running everything through one model. You can use Mercury 2 where it’s fast and cheap, and pull in a frontier model only when you need it.
Building that kind of workflow in MindStudio takes about 15 minutes to an hour using a visual, no-code editor — no API keys to juggle, no infrastructure to manage. You can try it free at mindstudio.ai.
Other agents ship a demo. Remy ships an app.
Real backend. Real database. Real auth. Real plumbing. Remy has it all.
If you’re a developer building more custom agentic systems, MindStudio’s Agent Skills Plugin (@mindstudio-ai/agent) lets you call 120+ typed capabilities from any agent framework — LangChain, CrewAI, Claude Code — as simple method calls, handling rate limiting and auth automatically.
Frequently Asked Questions
What is a diffusion language model?
A diffusion language model (dLLM) generates text by starting with a masked or noisy sequence and progressively filling in tokens over multiple passes, rather than generating tokens one at a time from left to right. This parallel approach allows the model to produce text faster, since many tokens can be resolved simultaneously instead of sequentially.
How does Mercury 2 compare to Claude Haiku 3.5?
Mercury 2 claims approximately 5x faster throughput than Claude Haiku 3.5 — roughly 1,000+ tokens per second versus 200–250 for Haiku. On quality benchmarks, particularly coding tasks, Mercury 2 is competitive with Haiku 3.5. However, for complex reasoning or nuanced instruction-following, Claude Haiku and similar autoregressive models may still produce better results. The choice depends on whether speed or output quality is the higher priority for your use case.
Is Mercury 2 good for code generation?
Mercury Coder 2, the coding-optimized variant, performs well on code generation benchmarks and is one of Mercury 2’s strongest use cases. The parallel token generation in diffusion models suits code tasks well because code often has structural patterns across multiple positions that can be inferred without strict left-to-right ordering.
What is the difference between Mercury and Mercury Coder 2?
Mercury 2 is the general-purpose language model. Mercury Coder 2 is a fine-tuned variant optimized for code-related tasks — generation, completion, fill-in-the-middle, and similar. If your primary use case is code, Mercury Coder 2 is the better choice. For general text tasks, Mercury 2 is the appropriate variant.
Does Mercury 2 support streaming output?
Inception Labs has implemented streaming support for Mercury 2, though the experience is slightly different from autoregressive models because of the denoising process. Tokens may appear in chunks rather than one at a time. For most API use cases this is fine, but for chat interfaces where users watch output appear character by character, it’s worth testing the streaming behavior against your UX requirements.
When should I not use Mercury 2?
Mercury 2 is not the right choice when you need frontier-level reasoning (complex analysis, nuanced argument construction, advanced math), when output quality is non-negotiable and speed is secondary, or when you’re working with tasks that require very long coherent outputs where diffusion models can occasionally introduce subtle inconsistencies. For those scenarios, models like Claude Sonnet, GPT-4o, or Gemini Ultra are better suited.
Key Takeaways
- Mercury 2 from Inception Labs is a diffusion language model that generates tokens in parallel rather than sequentially, producing a claimed 5x speed improvement over Claude Haiku 3.5.
- The diffusion approach uses masked token prediction across multiple denoising steps — fundamentally different from autoregressive generation.
- Mercury 2 is most valuable in high-throughput workloads: agentic pipelines, batch processing, code generation, and cost-sensitive API applications.
- Quality is competitive at the Haiku/mini tier but doesn’t match frontier models for complex reasoning tasks.
- The practical unlock is using Mercury 2 as part of a mixed-model strategy — routing speed-sensitive tasks to Mercury 2 while reserving more capable models for where quality matters most.
If you want to build that kind of multi-model workflow without managing infrastructure, MindStudio is worth trying. You can connect 200+ models, chain them in a visual workflow, and ship a working AI agent faster than it takes to read about the alternatives.