Skip to main content
MindStudio
Pricing
Blog About
My Workspace

IBM Granite Speech 4.1: 3 Models, One Leaderboard Crown, and a 2-Second Hour of Audio

IBM's new ASR suite has three models for three use cases. The fastest transcribes an hour of audio in 2 seconds. Here's what each one does.

MindStudio Team RSS
IBM Granite Speech 4.1: 3 Models, One Leaderboard Crown, and a 2-Second Hour of Audio

IBM Just Released Three ASR Models at Once. Here’s What Each One Actually Does.

IBM shipped the Granite Speech 4.1 suite last week, and if you’ve been sleeping on their model releases, this is the one that should wake you up. Three models, all around 2 billion parameters, each built for a different bottleneck: the 2B base (WER 5.33%, RTF ~231), the 2B Plus (diarization, word-level timestamps), and the 2BN (RTF 1820, one hour of audio in roughly 2 seconds). You don’t pick the best one — you pick the right one for your pipeline.

That framing is IBM’s own, and it’s the correct way to think about this release. Most ASR model launches give you one model and ask you to live with its trade-offs. Granite Speech 4.1 gives you three distinct tools with explicit trade-off documentation. That’s a different kind of product decision.

The suite sits on the Hugging Face Open ASR Leaderboard, where the base model currently holds the top position. It’s compatible with the standard Transformers library via AutoProcessor. And all three models are open weights, which matters if you’re building anything that can’t route audio through a third-party API.


The Workhorse: Granite Speech 4.1 2B Base

The base model is the one with the leaderboard crown. Word error rate of 5.33% across a diverse set of benchmarks — not just LibriSpeech, which tends to flatter models that have been trained heavily on clean read speech. The Open ASR Leaderboard average is a harder, more honest number, and 5.33% puts it above Whisper, Parakeet, and Canary in the current rankings.

Speed is where it gets interesting for production use. The real-time factor on the leaderboard is approximately 231, which translates to transcribing a full hour of audio in about 16 seconds. That’s already fast enough to make synchronous transcription feel instant for most use cases.

The model handles seven languages: English, French, German, Spanish, Portuguese, Japanese, and bidirectional speech translation to and from English. That last part is underappreciated — you can feed it audio in French and get English text out, or feed it English audio and get Spanish text. For a 2B parameter model, that’s a wide surface area.

The feature that will matter most for domain-specific deployments is keyword biasing. You pass a list of names, acronyms, or technical terms directly in the prompt, and the model weights its recognition toward those tokens. If you’re transcribing medical dictation, legal proceedings, or any field with specialized vocabulary and unusual spellings, this is the feature that separates usable transcripts from ones that require heavy post-processing. The base model is autoregressive — standard transformer architecture, nothing exotic — which is why it can support these prompt-level controls.

For teams building transcription pipelines that need accuracy across real-world audio, the base model is the obvious starting point. It’s the one you’d reach for if you’re not sure which variant fits yet.


The Structured Transcript Model: Granite Speech 4.1 2B Plus

The Plus model trades some language coverage and a few accuracy points for features that matter enormously when transcript structure is the product.

Speaker-attributed ASR — what the field calls diarization — is the headline. The model outputs speaker labels (Speaker 1, Speaker 2) alongside the transcript. It won’t give you names, but swapping generic labels for real names in post-processing is trivial once you have clean attribution. If you’re building a meeting recorder, a podcast tool, or anything where “who said what” is load-bearing information, the base model simply can’t do this job.

Word-level timestamps are the second major feature. Every word gets tagged with its end time. IBM claims the timestamp accuracy beats customized versions of Whisper X — tools that were specifically engineered for this task. That’s a strong claim, and it’s the kind of thing worth testing against your own audio before committing to a pipeline change. But if you’ve been using Whisper X specifically for word-level timing and finding it brittle, the Plus model is worth a serious evaluation.

The incremental decoding feature is less flashy but practically useful. You can pass a previously transcribed chunk as a prefix, and the model picks up from there. For long recordings — a four-hour podcast, a full-day conference — you need to chunk audio anyway. Incremental decoding lets you maintain consistent speaker numbering across chunks and avoid the seam artifacts that plague naive chunking approaches.

The trade-offs are real. Language support drops from seven to five — Japanese is gone, and so is the translation capability. Word error rate is slightly higher than the base model. These aren’t dealbreakers, but they’re the reason IBM ships three models instead of one.

The Plus model is the right choice when the transcript is a deliverable, not just an intermediate artifact. Court reporters, podcast producers, anyone building tools where the output goes directly to a human reader — this is your model.


The Throughput Engine: Granite Speech 4.1 2BN

The 2BN is a different kind of model entirely. The N stands for non-autoregressive, and that architectural choice is what produces the number that stops people mid-sentence: a real-time factor of 1820 on an H100, meaning one hour of audio transcribed in approximately 2 seconds.

To understand why that’s unusual, you need to understand what every other major ASR model is doing. Whisper, Parakeet, Canary — they’re all autoregressive. They generate one token at a time, each conditioned on the previous one. The GPU does a tiny forward pass, waits, does another. Parallelizing that is hard because the sequence is inherently sequential.

IBM’s solution is a technique called NLE: Non-autoregressive LLM-based Editing. Instead of generating a transcript from scratch in parallel (which tends to produce poor results because you lose the ability to condition on prior context), the model runs a two-step process. First, a frozen CTC encoder runs over the audio and produces a draft transcript quickly and cheaply. Then a bidirectional attention pass edits that draft — copy, insert, delete, replace operations — using full context in both directions. The draft gets most of the way there; the editing pass fixes what the CTC encoder missed.

The result is near-autoregressive accuracy at non-autoregressive speed. The word error rate doesn’t crater the way you’d expect from a model running at 1820x real-time.

The trade-offs are significant, though. No keyword biasing. No speaker attribution. No timestamps. No translation. The 2BN is a raw throughput machine. If you need to process hundreds of hours of audio and the output is going into a search index or a training dataset rather than a human-readable document, this is the model. If you need any of the structured features, it’s not.

There’s also a practical deployment note: the non-autoregressive model requires Flash Attention. On standard Colab instances with older GPUs like T4s, getting Flash Attention installed can be painful. If you’re running CUDA 13, you may need to compile Flash Attention yourself, as the pre-built wheels often don’t match. The model card notes H100 for the 1820x benchmark — on consumer hardware, you’ll see lower numbers, though still fast.


How the Three Models Fit Together

The Granite Speech 4.1 suite is designed around a question IBM is essentially asking you to answer before you pick a model: what is your actual bottleneck?

If your bottleneck is accuracy across diverse real-world audio, and you need language coverage plus keyword control, the base model is your answer. If your bottleneck is transcript structure — who said what, when, with word-level precision — the Plus model is your answer. If your bottleneck is raw throughput and you’re processing audio at scale where seconds-per-hour matter, the 2BN is your answer.

How Remy works. You talk. Remy ships.

YOU14:02
Build me a sales CRM with a pipeline view and email integration.
REMY14:03 → 14:11
Scoping the project
Wiring up auth, database, API
Building pipeline UI + email integration
Running QA tests
✓ Live at yourapp.msagent.ai

What’s notable is that these aren’t just marketing segments. The architectural differences between the three models are real. The 2BN isn’t a quantized or pruned version of the base — it’s a fundamentally different architecture with different capabilities and different constraints. IBM built three separate tools, not one tool with three pricing tiers.

For teams building agentic pipelines where transcription is one node in a larger workflow, this matters. A pipeline that needs to route audio through different models based on the task — high-accuracy transcription for legal review, fast batch processing for archive ingestion, structured output for meeting summaries — can now do that with a single model family. Platforms like MindStudio handle this kind of orchestration across 200+ models and 1,000+ integrations, which is relevant when your transcription output needs to flow into downstream tools like Notion, Slack, or a CRM without custom glue code.


Fine-Tuning and What Comes Next

IBM has a fine-tuning notebook on GitHub from the previous Granite Speech version that should work with 4.1. The use cases here are specific and worth naming: court transcripts, where speaker patterns and legal vocabulary are consistent enough to make fine-tuning worthwhile; podcast transcription, where you can use existing episode transcripts as training data to adapt the model to a specific host’s voice, accent, and vocabulary.

The podcast case is particularly practical. If you have 50 episodes of a show already transcribed (even imperfectly), you have enough signal to fine-tune a model that will outperform a general-purpose ASR system on that specific voice. The base model’s keyword biasing handles some of this at inference time, but fine-tuning handles the deeper patterns — rhythm, pronunciation, domain vocabulary — that a prompt can’t capture.

This is also where the broader Granite family becomes relevant. IBM’s Dockling models handle OCR and structured PDF extraction from the same model family. If your pipeline involves both audio and documents — think legal discovery, where you’re processing both recorded depositions and written filings — you’re working within a single ecosystem rather than stitching together models from different providers.

For teams thinking about how to go from transcription output to a full application — say, a searchable deposition archive with speaker attribution and document cross-referencing — the tooling question becomes real quickly. Remy takes a different approach to that build problem: you write the application as an annotated spec in markdown, and it compiles a complete TypeScript backend, SQLite database, auth layer, and deployment from that spec. The spec is the source of truth; the code is derived output. That’s a different abstraction layer than writing the orchestration by hand.


The Honest Assessment

IBM’s Granite models have been underrated, and the speech release is a good example of why. The company doesn’t have the marketing surface area of OpenAI or Anthropic, and “IBM releases open ASR models” doesn’t generate the same social media velocity as a frontier model announcement. But the work is real.

Holding the top position on the Open ASR Leaderboard with a 5.33% word error rate is a concrete, verifiable result. A real-time factor of 1820 on an H100 is a concrete, verifiable result. Word-level timestamps that reportedly beat customized Whisper X is a claim worth testing, but it’s a specific claim with a specific comparison, not a vague assertion of superiority.

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

The risk for IBM is the same one that affected Microsoft’s Phi models: a strong initial release followed by reduced investment and a model family that stagnates. The speech suite is good enough that it would be a genuine loss if IBM doesn’t continue iterating on it. The three-model architecture suggests they’ve thought carefully about the problem space. Whether that thinking continues into 4.2 and beyond is the open question.

For now, if you’re building anything that involves transcription — and increasingly, that’s most AI applications — the Granite Speech 4.1 suite deserves a place in your evaluation. The base model is worth benchmarking against whatever you’re currently running. The Plus model is worth testing if you’ve been fighting with Whisper X for word-level timestamps. And the 2BN is worth knowing exists for the day you have a batch job that would otherwise take hours.

Three models, one decision framework. That’s a more useful product than most ASR releases manage to be.


If you’re evaluating ASR models for production pipelines, the GPT-5.4 vs Claude Opus 4.6 comparison covers a similar “pick the right model for the task” framework for language models. For teams building AI agents that consume transcription output, the AI agents for product managers post covers how structured transcript data fits into broader PM workflows. And if you’re thinking about edge deployment for any of these models, the Gemma 4 edge deployment breakdown covers the constraints and trade-offs in detail.

Presented by MindStudio

No spam. Unsubscribe anytime.