Granite Speech 4.1 2BN Transcribes 1 Hour of Audio in 2 Seconds on H100 — How NLE Makes It Possible

One Hour of Audio. Two Seconds. Here’s the Architecture Behind It.

Granite Speech 4.1 2BN achieves a real-time factor of 1820 on an H100 — meaning it can transcribe a full hour of audio in approximately 2 seconds. That number is not a typo, and it’s not a cherry-picked benchmark on a toy dataset. It comes from IBM’s own model card, and it’s the result of a specific architectural choice called NLE: Non-autoregressive LLM-based Editing.

If you’re building anything that involves transcribing audio at scale — podcast archives, call center recordings, court transcripts, video libraries — this number should stop you in your tracks.

The question worth asking is: how does a model get there without completely falling apart on accuracy? That’s what this post is about.

The Benchmark That Requires Explanation

The Hugging Face Open ASR Leaderboard tracks real-time factor (RTF×) alongside word error rate. RTF× tells you how many seconds of audio a model can process per second of compute. A score of 1 means real-time. A score of 231 — which is where the Granite Speech 4.1 2B base model sits — means you can transcribe an hour of audio in about 16 seconds.

The 2BN model’s score of 1820 means you can transcribe that same hour in roughly 2 seconds.

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

For context: the base model already holds the #1 position on the Open ASR Leaderboard with a word error rate of 5.33%, beating Whisper, Parakeet, and Canary. The 2BN model trades some features for an 8× throughput gain over that already-fast baseline.

The question is what you’re trading, and whether the trade is worth it for your use case.

Why Autoregressive Models Have a Speed Ceiling

To understand why 1820 is notable, you need to understand why most ASR models can’t get there.

Whisper, Parakeet, Canary — virtually every transformer-based ASR model in wide use today is autoregressive. They generate one token at a time. Each token is conditioned on the previous one. That means decoding is fundamentally sequential: the GPU does a forward pass, produces a token, waits, does another forward pass, produces the next token.

This is a hardware utilization problem. Modern GPUs are built for massive parallelism. Autoregressive decoding forces them into a serial loop. You’re leaving most of the silicon idle most of the time.

The obvious fix — predict the entire transcript in parallel in a single shot — has been tried for years. It generally doesn’t work well. When you predict a whole sequence from scratch without conditioning on what you’ve already written, accuracy degrades badly. The model loses the ability to use context it has already established.

IBM’s answer to this is NLE, and it’s a more elegant solution than “just predict everything at once.” The same tension between sequential quality and parallel speed shows up across model families — it’s worth reading how effort levels affect output quality in coding models to see how different architectures handle the tradeoff.

How NLE Actually Works

Non-autoregressive LLM-based Editing is a two-step process.

Step one: a frozen CTC encoder runs over the audio and produces a draft transcript. CTC (Connectionist Temporal Classification) encoders are cheap and fast. They’re not perfect, but they get most of the words right most of the time. Think of this as a rough first pass — good enough to be a starting point, not good enough to ship.

Step two: a bidirectional attention model reads that draft and edits it. It can copy tokens that are already correct, insert missing words, delete errors, or replace wrong tokens with right ones. Because the attention is bidirectional — not left-to-right like an autoregressive decoder — the model can see the entire draft at once and make all its edits in parallel.

This is the key insight. You’re not asking the model to generate a transcript from nothing. You’re asking it to fix a draft. That’s a much easier task, and it’s one that maps naturally onto parallel computation. The CTC draft handles the heavy lifting of “what words are probably here,” and the editing pass handles the refinement.

The result: you get the parallelism benefits of non-autoregressive decoding without the accuracy collapse that comes from predicting everything cold.

The Trade-offs Are Real and Specific

The 2BN model is not a drop-in replacement for the base or Plus variants. IBM is explicit about what you’re giving up.

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY

✓Designed the data model

✓Picked an auth scheme — sessions + RBAC

✓Wired up Stripe checkout

✓Deployed to production

Live at yourapp.msagent.ai

No keyword biasing. The base model lets you pass a list of names, acronyms, or technical terms in the prompt, and the model weights recognition toward them. If you’re transcribing domain-specific content — medical terminology, product names, legal jargon — keyword biasing can meaningfully reduce errors on exactly the words that matter most. The 2BN model has none of this.

No speaker diarization. The Plus model adds speaker-attributed ASR: it labels output as “Speaker 1” and “Speaker 2,” which you can then map to real names in post-processing. The 2BN model produces a flat transcript with no speaker information.

No word-level timestamps. The Plus model reportedly beats customized versions of Whisper X on timestamp accuracy — a significant claim given that Whisper X was specifically built for that task. The 2BN model gives you words, not timing.

No translation. The base model supports bidirectional speech translation across seven languages (English, French, German, Spanish, Portuguese, Japanese, and English as a hub). The 2BN model is transcription-only.

The 2BN model also requires flash attention, which adds a dependency. If you’re running on older hardware or in a Colab environment with a T4, you may hit compatibility issues — particularly if you’re on CUDA 13, where you may need to compile flash attention yourself rather than installing a prebuilt wheel.

What you get in exchange for all of this: 1820× real-time throughput on an H100, with word error rate that stays competitive.

Where This Actually Matters

The use cases for the 2BN model are narrower than the base or Plus variants, but within those use cases, the throughput advantage is decisive.

Consider a podcast network with 10 years of back catalog — thousands of hours of audio that need to be transcribed for search indexing, accessibility compliance, or training data. With an autoregressive model at RTF× 231, transcribing 1,000 hours of audio takes roughly 4.3 hours of compute time. With the 2BN model at RTF× 1820, that same job takes about 33 minutes. The difference isn’t marginal; it changes whether this is a batch job you run overnight or a job you run in the time it takes to get coffee.

The same logic applies to any bulk ingestion pipeline: legal discovery, media monitoring, academic corpus processing, fine-tuning data preparation. If you need the text and you don’t need speaker labels or timestamps, the 2BN model is the right tool.

For fine-tuning specifically: the GitHub repository for Granite Speech includes a fine-tuning notebook from the previous Granite version that should carry over. The suggested workflow — use existing episode transcripts as training data to fine-tune for a specific host’s voice or accent — is practical and well-suited to the 2BN model’s throughput characteristics. You can generate a large training corpus fast, then fine-tune on it.

When you’re building the downstream application that consumes these transcripts — say, a search interface or a summarization pipeline — the orchestration layer matters as much as the model. MindStudio is an enterprise AI platform with 200+ models and 1,000+ integrations that handles exactly this kind of multi-model chaining: you could wire the 2BN transcription output into a summarization model, a classification step, or a retrieval system without writing the orchestration code from scratch.

The Non-obvious Detail: Batching Is the Multiplier

The RTF× 1820 figure is a batched benchmark on an H100. This is worth unpacking.

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Non-autoregressive models benefit from batching in a way that autoregressive models don’t, because the editing pass is a single parallel forward pass over the entire sequence. When you batch multiple audio files together, you’re filling the GPU’s parallel compute capacity more completely. The H100 has enough memory bandwidth and compute to handle large batches efficiently, which is why the benchmark number is as high as it is.

On a consumer GPU — an RTX Pro 6000 Blackwell, for instance, which is what the demo hardware used — you won’t see 1820×. The model still runs fast, but the batching dynamics are different and the memory bandwidth is lower. The H100 number is a ceiling, not a floor.

This also means that if you’re running the 2BN model in a low-latency single-file context (one audio file at a time, no batching), you won’t see the full throughput benefit. The model is optimized for bulk processing. If your use case is real-time transcription of a single live stream, the base model’s RTF× 231 is probably sufficient and comes with more features.

The practical implication: design your pipeline around batching if you want to approach the benchmark numbers. Chunk your audio, batch the chunks, and process them together rather than sequentially.

What the Architecture Tells You About the Broader Direction

NLE is not a one-off trick. It’s an instance of a broader pattern: use a cheap, fast model to produce a draft, then use a more capable model to refine it. You see this in speculative decoding for LLMs, in cascaded ASR pipelines, and in draft-then-edit approaches to text generation.

The reason this pattern keeps appearing is that it maps well onto what GPUs are actually good at. Parallel refinement of an existing sequence is a much better fit for modern hardware than sequential generation from scratch.

IBM’s implementation is notable because they’ve applied it to ASR specifically and gotten the accuracy to hold up. The word error rate on the 2BN model doesn’t crater relative to the base model, which is the historical failure mode for non-autoregressive approaches. The CTC draft quality is high enough that the editing pass doesn’t have to do heroic work.

This is also why the model is 2 billion parameters rather than much larger. The editing task, given a good draft, is tractable at this scale. You don’t need a massive model to fix a mostly-correct transcript.

If you’re thinking about where to apply similar patterns in your own pipelines — draft-then-refine, cheap-encoder-plus-parallel-editor — the Granite Speech architecture is a concrete existence proof that the approach works at production quality. For teams building spec-driven applications where the pipeline logic itself needs to be versioned and maintained, Remy takes a related approach: you write an annotated markdown spec as the source of truth, and the full-stack application — TypeScript backend, database, auth, deployment — is compiled from it. The draft-then-refine intuition applies at the application layer too.

What to Do With This

If you’re evaluating ASR infrastructure this quarter, the Granite Speech 4.1 suite is worth a serious look. The 2BN model specifically is worth benchmarking if you have bulk transcription workloads.

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

remy.msagent.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

The model is available on Hugging Face and is compatible with the standard transformers library via AutoProcessor. The code to get it running is straightforward. The main dependency to sort out is flash attention — make sure your PyTorch version, CUDA version, and flash attention version are aligned before you start debugging mysterious errors.

For H100 access, the benchmark numbers are real but require batched inference to approach. If you’re on a cloud provider with H100 instances, run your own benchmark on a representative sample of your audio before committing to the architecture.

The fine-tuning notebook on GitHub is worth bookmarking even if you don’t use it immediately. The workflow — existing transcripts as training data, fine-tuned model for a specific voice or domain — is one of the more practical fine-tuning setups available for open ASR models right now.

One opinion: IBM is not getting enough credit for this release. The Granite Speech suite solves three distinct problems (accuracy, structure, throughput) with three distinct models, each with a clear use case. That’s a more coherent product decision than shipping one model and calling it universal. The 2BN model in particular is the kind of thing that should show up in open-weight model comparisons more often than it does — it’s not competing on the same axes as most models in that conversation, which is exactly the point. For a sense of how frontier closed models are benchmarked against each other on similar throughput and accuracy tradeoffs, the GPT-5.4 vs Claude Opus 4.6 comparison is a useful reference frame for what “production-ready” looks like at the top of the leaderboard.

The non-autoregressive approach has been a research curiosity for years. Granite Speech 4.1 2BN is the clearest demonstration yet that it’s ready for production. Two seconds per hour of audio is not a research result. It’s an infrastructure decision.