What Is Mercury 2? The Diffusion-Based Language Model That Runs 5x Faster Than Claude Haiku

A Different Approach to Language Generation

Most language models generate text one token at a time, left to right, waiting for each word before producing the next. It works, but it creates a hard ceiling on speed. Mercury 2 from Inception Labs throws out that approach entirely — and the result is a model that produces text in parallel, at speeds that benchmark around five times faster than Claude Haiku.

That’s a meaningful difference. Mercury 2 is a diffusion-based language model, a category that applies ideas from image generation directly to text. If you’ve used Stable Diffusion or Midjourney, you’ve seen diffusion models at work in images. Mercury 2 is what happens when that same core mechanism gets applied to language — and it changes what’s possible for latency-sensitive applications.

This article breaks down what Mercury 2 is, how its architecture actually works, where it outperforms traditional autoregressive models, and where the tradeoffs still exist.

What Is Inception Labs, and Why Build a Diffusion LLM?

Inception Labs is the company behind Mercury. Founded by researchers with roots at Stanford and elsewhere in the machine learning community, the company was built around a specific thesis: autoregressive generation is a bottleneck, and the standard token-by-token architecture has fundamental speed limits that can’t be engineered away.

Their answer was to build a family of large language models from scratch using diffusion as the core generation mechanism — not as a bolt-on, but as the foundational architecture.

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

Mercury is that model family. Mercury 2 is their second-generation release, building on the original Mercury Coder models with improved quality, broader capability, and better benchmark performance. The family currently includes coding-focused variants and general-purpose models at different parameter scales.

The core argument Inception Labs makes is straightforward: if you need high throughput — many requests, low latency, fast output — diffusion-based generation offers structural advantages that autoregressive models simply can’t match at equivalent quality levels.

How Diffusion Models Work for Text

To understand Mercury 2, it helps to understand what diffusion actually means in the context of language.

The Image Diffusion Background

In image generation, diffusion works through a two-phase process. During training, noise is progressively added to images until they become pure noise. The model learns to reverse that process — given a noisy image, predict the less noisy version. At inference, you start from random noise and run the denoising process repeatedly until a coherent image emerges.

The key feature: the model processes the entire image at once in each denoising step. It doesn’t generate pixel by pixel from left to right.

Masked Diffusion for Language

Text is discrete, not continuous like pixel values, so you can’t add Gaussian noise to a sentence the same way you’d corrupt an image. Instead, diffusion language models typically use masked diffusion — a process where tokens are randomly masked (replaced with a special [MASK] token) during training, and the model learns to predict all masked positions simultaneously.

At inference time, Mercury 2 starts with a sequence that’s entirely masked and iteratively refines it. In each pass, some tokens get “revealed” — committed to specific values — while others remain uncertain and continue to be refined. After a set number of denoising steps, the full sequence is complete.

The critical difference from autoregressive generation: Mercury 2 can predict multiple tokens in parallel in each step. It isn’t blocked waiting for token N before it can predict token N+1.

Why This Enables Higher Throughput

Autoregressive models have an inherent sequential dependency. Claude, GPT-4, and Llama all generate one token, then use that token as part of the context to generate the next. You can’t parallelize this at the token level — it’s a serial chain by design.

Diffusion models break that chain. Because they refine the entire output simultaneously, modern hardware (GPUs are built for parallel computation) can work much more efficiently. The result is significantly higher tokens-per-second throughput for the same amount of compute.

Mercury 2 Architecture: What’s New in the Second Generation

Mercury 2 improves on the original Mercury Coder release in several ways.

Improved Denoising Steps

One of the tuning challenges with diffusion language models is determining how many denoising steps to use. Too few and quality suffers — not enough refinement passes. Too many and you lose the speed advantage.

Mercury 2 uses a more efficient denoising schedule that achieves high-quality outputs in fewer iterations than the first generation. This is part of how it maintains competitive quality while keeping throughput high.

Better Calibration on Natural Language Tasks

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

The original Mercury models were primarily positioned as coding models. Mercury 2 expands capability across general instruction following, reasoning tasks, and natural language generation — not just code completion. The quality gap between Mercury 2 and leading autoregressive models narrows on these broader task categories compared to the first generation.

Scalable Architecture

Mercury 2 is available at multiple scales, allowing teams to choose between faster/lighter variants and larger models with higher reasoning quality. The architecture was designed to scale consistently, so the diffusion approach holds its speed advantages even at larger parameter counts.

Speed Benchmarks: Mercury 2 vs. Claude Haiku and Others

The headline claim is that Mercury 2 runs approximately five times faster than Claude Haiku in terms of throughput. This needs some context.

What “5x Faster” Actually Means

The speed comparison is measured in output tokens per second — how many tokens the model can generate per second at a given level of compute. This is the throughput metric that matters most for production applications like code completion, document generation, or real-time chat.

Claude Haiku is itself one of Anthropic’s faster models, optimized for speed within their lineup. Comparing to Haiku rather than Claude 3.5 Sonnet or Opus is a meaningful choice — Haiku already represents a speed-quality tradeoff optimized toward the speed end.

Mercury 2 claiming 5x over Haiku on tokens-per-second puts it in a genuinely different performance category for raw generation speed.

The Latency vs. Throughput Distinction

It’s worth being precise: latency (time to first token) and throughput (tokens per second once generation starts) are different metrics.

Autoregressive models can have low time-to-first-token because they start generating immediately. Diffusion models need to run at least one full denoising pass before producing any output, which can mean slightly higher latency to first token in some configurations.

Where Mercury 2 wins decisively is sustained throughput — generating long outputs quickly. For applications that need to produce hundreds or thousands of tokens (code generation, long-form drafts, document processing), the throughput advantage is the number that matters.

Quality Benchmarks

Speed alone doesn’t matter if the output quality isn’t there. Inception Labs has published benchmark results showing Mercury 2 competitive with models like Claude Haiku and similar-scale autoregressive models on standard coding and reasoning benchmarks.

The honest picture: Mercury 2 isn’t ahead of larger, more capable models like Claude 3.5 Sonnet or GPT-4o on reasoning-heavy tasks. But it’s positioned to match or exceed smaller, faster autoregressive models — which is the relevant comparison for its use case.

Where Mercury 2 Fits (and Where It Doesn’t)

Not every application needs Mercury 2’s speed. And Mercury 2 isn’t the right pick for every use case. Here’s a realistic view of where it fits well.

Strong Use Cases

Code completion and generation — Mercury’s roots are in coding models, and Mercury 2 continues to perform well on code tasks. High-throughput code generation (producing full files, refactoring large codebases, generating test suites) benefits directly from fast token output.

Batch processing workflows — Any pipeline that processes many documents, generates many summaries, or runs many completions in parallel benefits from raw throughput. Mercury 2’s architecture is designed for this.

Cost-sensitive, high-volume applications — Faster tokens per second at the same compute cost means lower cost per token. For applications running millions of completions, this matters a lot.

Real-time generation in interfaces — Applications where users watch text stream in benefit from high throughput. Mercury 2 can fill a screen faster than slower models.

Where Autoregressive Models Still Have an Edge

Complex multi-step reasoning — Tasks requiring careful, chain-of-thought reasoning across many logical steps still tend to favor larger autoregressive models with strong RLHF training. The jury is still out on how well diffusion models scale into deep reasoning.

Tasks requiring precise instruction following — Very precise formatting, structured outputs with complex schemas, and highly constrained generation tasks have historically been stronger suits for well-tuned autoregressive models.

Latency-critical first-token applications — If your application shows users partial output and time-to-first-token is the critical metric, evaluate Mercury 2’s specific denoising latency carefully against your target.

How MindStudio Lets You Test Mercury 2 (and 200+ Other Models) Without Setup

One of the practical barriers to evaluating new models like Mercury 2 is the setup overhead. You’d normally need to find the API, get credentials, write integration code, build a test harness, and compare outputs manually.

MindStudio removes that friction. The platform gives you access to over 200 AI models — including Mercury 2 and models like Claude Haiku — directly in a no-code visual builder. You can swap models inside any AI agent or workflow with a single dropdown change, which means you can run real comparisons using your actual prompts and data, not synthetic benchmarks.

This is genuinely useful when a new model like Mercury 2 enters the picture. The benchmarks tell you one story, but running your specific use case — your prompts, your output length expectations, your quality bar — tells you a different one. MindStudio makes that comparison take minutes instead of days.

For teams building automated document processing pipelines, code generation tools, or high-volume content workflows, plugging in Mercury 2 and comparing its throughput against Claude Haiku or other models in the same workflow is exactly the kind of evaluation that helps you make a real decision.

You can try MindStudio free at mindstudio.ai.

Diffusion LLMs vs. Autoregressive LLMs: The Bigger Picture

Mercury 2 isn’t just an interesting model — it represents a broader question about whether autoregressive generation is the right architecture for all language tasks long-term.

Autoregressive Models Dominate — But Why?

The dominance of autoregressive models (GPT, Claude, Llama, Gemini) isn’t because the architecture is theoretically optimal. It’s largely because the training recipe is well-understood, the scaling laws are predictable, and enormous amounts of infrastructure have been built around it.

Diffusion models for text have existed in research for several years, but Inception Labs is among the first to push them to production scale with competitive quality. The research on masked diffusion language models from academic groups laid important groundwork, but Mercury represents a serious commercial implementation.

The Parallel Generation Advantage May Compound

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

goremy.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

As models get larger and output length requirements grow, the per-token cost of autoregressive generation increases linearly. A model generating a 4,000-token response does four times as much sequential work as one generating 1,000 tokens.

Diffusion models have more favorable scaling properties for long outputs because many tokens are refined in parallel per denoising step. The practical advantage of the approach may actually grow as applications demand longer context outputs.

What This Means for the Model Landscape

Mercury 2 isn’t positioning itself as a general-purpose frontier model competing with GPT-4o. The smarter read is that it occupies a distinct segment: high-throughput, cost-efficient generation for applications where speed matters and the task complexity is well-defined.

That’s a large and valuable segment. Code generation, document processing, AI agent workflows that need to generate many intermediate outputs, and real-time applications all fit this profile.

The likely outcome isn’t that diffusion models replace autoregressive models — it’s that they become the right choice for a specific class of use cases where throughput is the primary constraint.

FAQ

What is Mercury 2?

Mercury 2 is a family of large language models built by Inception Labs using a diffusion-based architecture rather than the autoregressive approach used by most modern LLMs. Instead of generating text one token at a time, Mercury 2 refines entire output sequences in parallel through a process adapted from image diffusion. The result is significantly higher throughput than comparably-sized autoregressive models.

How is a diffusion language model different from a regular LLM?

Standard LLMs (like Claude, GPT-4, and Llama) generate text autoregressively — each token is produced sequentially, depending on all previous tokens. Diffusion language models start with a masked or noisy sequence and refine all positions simultaneously across multiple denoising steps. This parallel generation allows modern GPU hardware to work much more efficiently, producing more tokens per second for the same compute.

Is Mercury 2 actually 5x faster than Claude Haiku?

The 5x speed claim refers to output throughput — tokens generated per second — rather than time-to-first-token or perceived response speed. Inception Labs benchmarks show Mercury 2 generating output at roughly five times the tokens-per-second rate of Claude Haiku in comparable conditions. For long-form generation tasks (documents, code files, batch processing), this throughput advantage translates directly to faster completion times and lower cost.

What is Mercury 2 best used for?

Mercury 2’s architecture makes it well-suited for high-throughput applications: code generation and completion, batch document processing, high-volume content workflows, and real-time generation in user-facing products. It’s competitive with Claude Haiku and similar models on coding and general tasks while offering significantly higher throughput.

Does Mercury 2 compromise on quality for speed?

There’s always a quality-speed tradeoff to consider, but Mercury 2’s benchmarks show it competitive with similar-scale autoregressive models on standard coding and reasoning tasks. It isn’t designed to compete with larger frontier models on complex reasoning tasks — its positioning is more directly against fast, efficient models like Claude Haiku and comparable open-weight models. For its target use cases, the quality is competitive.

Can I use Mercury 2 through MindStudio?

Yes. MindStudio provides access to 200+ AI models including Mercury 2 without requiring separate API accounts or integration code. You can build workflows that use Mercury 2 and compare its outputs directly against other models using the same prompts. This is useful for evaluating whether Mercury 2’s speed advantage matters for your specific use case.

Key Takeaways

Mercury 2 is a diffusion-based language model from Inception Labs that generates text by refining entire output sequences in parallel, rather than producing tokens one at a time.
The core speed advantage comes from parallel token refinement: diffusion generation maps well to GPU hardware in ways that autoregressive generation fundamentally doesn’t.
Benchmarks show Mercury 2 generating output at roughly five times the throughput of Claude Haiku, making it a strong candidate for code generation, batch processing, and high-volume text workflows.
Mercury 2 is competitive with similar-scale autoregressive models on quality, though it isn’t positioned to replace frontier models on complex reasoning tasks.
The most practical way to evaluate whether Mercury 2’s speed advantage matters for your use case is to test it against your actual prompts — MindStudio makes this easy without any setup overhead.

If you’re building AI workflows that require fast, high-volume text generation, Mercury 2 is worth a close look. You can start testing it alongside other models on MindStudio for free.