How to Use IBM Granite Speech 4.1 for Speaker Diarization and Word-Level Timestamps
IBM Granite Speech 4.1 Plus adds speaker attribution and word-level timestamps to transcription. Learn how to use it for meetings, podcasts, and interviews.
What IBM Granite Speech 4.1 Actually Does
If you’ve ever tried to get a clean, searchable transcript from a two-hour team meeting or a podcast episode with three guests, you know the problem. A basic transcription tool gives you a wall of text. You don’t know who said what, and there’s no easy way to find the moment someone made a key point.
IBM Granite Speech 4.1 — specifically the Plus variant — addresses that directly. It adds speaker diarization and word-level timestamps to automatic speech recognition, giving you structured output you can actually use downstream. Whether you’re building a meeting intelligence tool, a legal transcription pipeline, or an automated podcast workflow, these capabilities change what’s possible with speech data.
This guide covers what IBM Granite Speech 4.1 Plus does, how speaker diarization works, how to set it up and run it, and where it fits into larger automated workflows.
Understanding the Granite Speech 4.1 Model Family
IBM’s Granite model family spans text, code, vision, and speech. The speech models are designed for production-grade audio tasks — not just hobbyist experiments. Granite Speech 4.1 is trained on a large multilingual corpus and optimized for accuracy across diverse audio conditions: different accents, background noise, variable recording quality.
There are two main variants:
- Granite Speech 4.1 Base — Handles core ASR (automatic speech recognition). You put in audio, you get text out.
- Granite Speech 4.1 Plus — Adds speaker diarization and word-level timestamps on top of transcription. This is the variant most people need for meeting notes, interview transcripts, or any multi-speaker content.
Other agents start typing. Remy starts asking.
Scoping, trade-offs, edge cases — the real work. Before a line of code.
Both models are available as open weights through Hugging Face and can be accessed via IBM’s watsonx platform. The Plus variant is the focus here because the base transcription use case is well-covered — it’s the diarization and timestamp features that make this model stand out.
What “Open Weights” Means Here
Granite Speech 4.1 is released under an Apache 2.0 license, meaning you can run it locally, modify it, and use it commercially without per-call API fees. For teams processing high volumes of audio (think: call centers or media companies), this matters a lot for cost structure.
Speaker Diarization Explained
Speaker diarization answers one question: who spoke when?
It segments an audio file into time ranges and labels each segment with a speaker ID. The output isn’t “John said X” by default — it’s “Speaker 1 said X from 0:23 to 0:45.” You map those IDs to real names afterward, usually by cross-referencing a known voice sample or by manual review.
How Diarization Works Under the Hood
Modern diarization typically uses a pipeline approach:
- Voice Activity Detection (VAD) — Identifies regions of the audio that contain speech versus silence or noise.
- Speaker segmentation — Splits the audio into short segments where each segment likely contains only one speaker.
- Speaker embedding — Converts each segment into a vector representation of that speaker’s voice characteristics.
- Clustering — Groups segments with similar embeddings together as the same speaker.
- Re-segmentation — Cleans up the boundaries for a final output.
Granite Speech 4.1 Plus integrates this into its transcription pipeline rather than requiring a separate diarization model. That’s a meaningful practical advantage — fewer models to manage, fewer integration points to break.
Common Diarization Challenges
Diarization is harder than transcription for a few reasons:
- Overlapping speech — Two people talking at once is difficult to attribute cleanly.
- Short turns — If someone says “right” or “yeah” frequently, those micro-turns can confuse the clustering.
- Similar voices — Voices with similar pitch and cadence are harder to distinguish.
- Variable audio quality — Phone calls, conference rooms, and recordings with background noise all degrade embedding quality.
Granite Speech 4.1 Plus handles these reasonably well on clean recordings and shows solid performance on typical meeting and interview audio.
Word-Level Timestamps: Why They Matter
Transcription without timestamps gives you text. Transcription with word-level timestamps gives you a searchable, referenceable record.
Every word in the output is tagged with a start time and an end time. That means:
- You can jump to the exact moment in a recording where a topic was discussed.
- You can generate automatic subtitles that sync precisely with the audio.
- You can build search tools that let users find specific moments by keyword.
- You can detect hesitation patterns, speaking pace, or segment durations for analytics.
Word-level timestamps are particularly valuable in legal, medical, and compliance contexts where precise attribution matters. But they’re also the foundation of more consumer-facing features — like podcast chapter generation or video caption editing.
Setting Up IBM Granite Speech 4.1 Plus
Here’s a practical walkthrough for running Granite Speech 4.1 Plus locally using Python and the Hugging Face transformers library.
Prerequisites
Before starting, make sure you have:
- Python 3.9 or higher
- PyTorch installed (with CUDA if you’re using a GPU)
- The
transformersanddatasetslibraries from Hugging Face librosaorsoundfilefor audio loading- Sufficient RAM or VRAM — the Plus model requires more than the Base variant
How Remy works. You talk. Remy ships.
Install the core dependencies:
pip install transformers torch torchaudio librosa soundfile
Step 1: Load the Model and Processor
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
model_id = "ibm-granite/granite-speech-4.1-plus"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
Using device_map="auto" lets the library decide whether to use GPU or CPU based on your hardware. For production, you’ll want GPU for reasonable throughput.
Step 2: Load and Preprocess Your Audio
The model expects audio at 16kHz. Most meeting recordings and podcast files are at different sample rates, so resampling is necessary.
import librosa
import numpy as np
audio_path = "meeting_recording.wav"
audio, sample_rate = librosa.load(audio_path, sr=16000, mono=True)
For longer files (anything over 30 minutes), process in chunks to avoid memory issues. A 30-second chunk with some overlap at the boundaries works well in practice.
Step 3: Run Transcription with Diarization
inputs = processor(
audio,
sampling_rate=16000,
return_tensors="pt",
return_timestamps="word" # Request word-level timestamps
).to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
return_timestamps=True,
output_scores=True
)
result = processor.decode(outputs[0], output_offsets=True)
The return_timestamps="word" parameter is what activates word-level timestamps. The output_offsets=True in the decode call ensures the timing data is included in the output object.
Step 4: Parse the Output
The result includes a chunks list where each entry contains the text, speaker ID, start time, and end time:
for chunk in result["chunks"]:
speaker = chunk.get("speaker", "Unknown")
text = chunk["text"]
start = chunk["timestamp"][0]
end = chunk["timestamp"][1]
print(f"[{start:.2f}s - {end:.2f}s] {speaker}: {text}")
A sample output might look like:
[0.12s - 0.84s] SPEAKER_00: Thanks
[0.84s - 1.60s] SPEAKER_00: everyone
[1.60s - 2.12s] SPEAKER_00: for
[1.60s - 3.40s] SPEAKER_00: joining today
[4.20s - 5.10s] SPEAKER_01: Happy
[5.10s - 5.90s] SPEAKER_01: to be here
Step 5: Structure the Output
For most applications, you’ll want to consolidate word-level chunks into speaker turns before storing or displaying results:
def consolidate_turns(chunks):
turns = []
current_speaker = None
current_text = []
current_start = None
current_end = None
for chunk in chunks:
speaker = chunk.get("speaker", "UNKNOWN")
if speaker != current_speaker:
if current_speaker is not None:
turns.append({
"speaker": current_speaker,
"text": " ".join(current_text),
"start": current_start,
"end": current_end
})
current_speaker = speaker
current_text = [chunk["text"]]
current_start = chunk["timestamp"][0]
current_end = chunk["timestamp"][1]
else:
current_text.append(chunk["text"])
current_end = chunk["timestamp"][1]
if current_speaker:
turns.append({
"speaker": current_speaker,
"text": " ".join(current_text),
"start": current_start,
"end": current_end
})
return turns
This produces clean, readable output per speaker segment rather than per word.
Real-World Use Cases
Meeting Intelligence
The most common application. Feed in a recorded team call and get back a structured transcript showing who said what, with timestamps. Downstream, you can:
- Auto-generate meeting summaries attributed to specific participants
- Extract action items with the speaker who committed to them
- Build searchable archives of past meetings
- Flag specific keywords and jump to those moments in the recording
The combination of diarization + timestamps is essential here. A summary that says “action item: ship the feature by Friday” isn’t as useful as “action item: [SPEAKER_01, 14:22] ship the feature by Friday.”
Podcast and Interview Production
Podcast editors use diarized transcripts for:
- Generating accurate show notes with speaker-attributed quotes
- Creating chaptered transcripts for listener navigation
- Building subtitle tracks for video versions of the podcast
- Training internal tools on interview content
Word-level timestamps make it straightforward to generate SRT or VTT subtitle files. There are Python libraries like pysrt that accept exactly the timestamp + text format Granite outputs.
Legal and Compliance Transcription
Depositions, hearings, and compliance calls all require transcripts that clearly show who said what. Manual transcription of this content is expensive and slow. Automated diarization with attribution can significantly reduce the human review burden — particularly when combined with a review step to map speaker IDs to actual names.
Call Center Quality Assurance
Call centers deal with high volumes of two-speaker audio (agent + customer). Diarized transcripts allow:
- Automated scoring of agent compliance with scripts
- Sentiment analysis per speaker
- Topic detection and escalation flagging
- Talk-time ratio analysis
The IBM documentation on Granite model capabilities provides additional context on enterprise deployment patterns for these workloads.
Research and Qualitative Analysis
Researchers conducting interviews can process dozens of recordings quickly. Word-level timestamps allow precise quotation with source attribution, which is useful when coding qualitative data or building thematic analyses.
Common Issues and How to Fix Them
Speaker IDs Aren’t Consistent Across Files
Diarization labels speakers relative to the current file. SPEAKER_00 in one recording isn’t the same person as SPEAKER_00 in another. If you’re building a system that tracks individuals across multiple recordings, you need to implement speaker enrollment — capture voice embeddings for known individuals and match against them.
Too Many Speaker IDs for the Actual Number of Speakers
This usually happens when:
- There’s significant background noise
- The audio has heavy compression artifacts
- Speakers have very different speaking styles within the same conversation
Try increasing the minimum cluster size in your diarization configuration, or apply noise reduction (using noisereduce or similar) before passing audio to the model.
Overlapping Speech Gets Misattributed
When two people speak simultaneously, the model has to make a call. You can detect these regions by looking for very short alternating speaker segments. Flag them for manual review rather than treating the attribution as reliable.
Memory Issues on Long Files
For files longer than 30–45 minutes, chunk the audio with a small overlap (2–3 seconds) at each boundary. Make sure your speaker IDs are normalized across chunks — otherwise you’ll see SPEAKER_00 appear in chunk 1 and again as SPEAKER_02 in chunk 3 even though it’s the same person.
Low Accuracy on Non-English Audio
Granite Speech 4.1 is multilingual, but accuracy varies by language. For less-resourced languages, you may get better results fine-tuning on domain-specific data or using language-specific models alongside Granite for specific segments.
Building Diarization Workflows Without Writing Code
Setting up the Python pipeline above is manageable for developers. But if you want to build production-ready pipelines — with scheduling, integrations, human review steps, and downstream automation — writing and maintaining custom code gets complex quickly.
This is exactly where MindStudio fits in. MindStudio is a no-code platform that lets you build AI agents and automated workflows, with access to 200+ models and 1,000+ integrations out of the box. You can build a complete meeting transcription and analysis workflow without managing infrastructure.
Other agents ship a demo. Remy ships an app.
Real backend. Real database. Real auth. Real plumbing. Remy has it all.
A practical example: an automated meeting intelligence agent that watches a Google Drive folder, detects new audio uploads, runs transcription and diarization, generates a structured summary, and sends the output to Notion or Slack — all without writing a line of code. Connecting to Granite Speech 4.1 Plus is handled through MindStudio’s model integration layer, so you’re not dealing with model loading, chunking logic, or API authentication.
For teams at companies like those in financial services, consulting, or research that deal with heavy meeting and interview workloads, this kind of end-to-end automation saves significant time. You can try MindStudio free at mindstudio.ai and have a basic audio processing workflow running in under an hour.
If you’re already building AI agents in code — with LangChain, CrewAI, or Claude Code — MindStudio’s Agent Skills Plugin exposes these workflow capabilities as typed method calls, so your agents can trigger audio processing, send results to downstream tools, and handle the infrastructure layer without extra plumbing.
Comparing Granite Speech 4.1 Plus to Alternatives
It’s useful to know where Granite Speech 4.1 Plus sits relative to other options.
| Model / Service | Diarization | Word Timestamps | Open Weights | Cost Model |
|---|---|---|---|---|
| Granite Speech 4.1 Plus | ✅ Built-in | ✅ Built-in | ✅ Apache 2.0 | Self-hosted or watsonx |
| OpenAI Whisper (large-v3) | ❌ Not native | ✅ Yes | ✅ MIT | Self-hosted |
| Assembly AI | ✅ Yes | ✅ Yes | ❌ API only | Per-minute pricing |
| Rev.ai | ✅ Yes | ✅ Yes | ❌ API only | Per-minute pricing |
| Deepgram | ✅ Yes | ✅ Yes | ❌ API only | Per-minute pricing |
Whisper is the most commonly used open-weight alternative, but it doesn’t include diarization natively. You’d need to combine it with a separate diarization library like pyannote.audio, which adds complexity and another model to manage.
The Granite Speech 4.1 Plus advantage is the integrated pipeline — one model, one inference call, structured output with both diarization and timestamps. For teams that need open-weights licensing (for privacy, compliance, or cost reasons), it’s a strong option. For detailed information on the Granite speech model architecture and benchmarks, Hugging Face’s model card provides the technical specifications.
Frequently Asked Questions
What is speaker diarization and why does it matter?
Speaker diarization is the process of segmenting an audio recording and labeling each segment by speaker — answering “who spoke when.” Without it, transcription gives you a single block of text with no attribution. With it, you get a structured conversation that shows each speaker’s contributions in order. This matters for any multi-speaker audio where attribution is needed: meetings, interviews, legal proceedings, call center recordings, and more.
How accurate is IBM Granite Speech 4.1 Plus for diarization?
Accuracy depends heavily on audio quality, the number of speakers, and how much overlap there is. On clean two-to-four speaker recordings (typical meeting or interview audio), Granite Speech 4.1 Plus performs well and is competitive with commercial APIs. Performance degrades on recordings with heavy background noise, more than six or seven speakers, or significant cross-talk. For precise benchmark figures, check the official IBM Granite model documentation.
Can IBM Granite Speech 4.1 run locally?
Yes. The model is released under an Apache 2.0 license and available on Hugging Face. You can run it on local hardware using the transformers library. A modern GPU (16GB+ VRAM recommended for the Plus variant) gives you practical inference speeds. CPU-only inference works but is significantly slower, making it impractical for long files.
What audio formats does Granite Speech 4.1 support?
The model works with any audio format you can load and convert to a 16kHz mono waveform in Python. WAV, MP3, MP4 audio tracks, FLAC, and OGG all work with standard libraries like librosa or soundfile. The key requirement is resampling to 16kHz before passing to the model processor.
How do I map speaker IDs to real names?
Granite Speech 4.1 Plus outputs generic labels (SPEAKER_00, SPEAKER_01, etc.). To map these to real names, you have a few options. If you have a known voice sample for each participant, you can extract embeddings and compare them to the diarized segments. For simpler workflows, a human review step works: display the first few segments per speaker and ask the reviewer to name each one. Tools like Label Studio or a lightweight custom UI can make this fast.
Does Granite Speech 4.1 support languages other than English?
Yes, Granite Speech 4.1 is multilingual. The training data includes multiple languages, though English and other high-resource languages have the strongest performance. For specialized domains (medical, legal) in non-English languages, consider fine-tuning on domain-specific data for best results.
Key Takeaways
- IBM Granite Speech 4.1 Plus combines transcription, speaker diarization, and word-level timestamps in a single open-weights model.
- Speaker diarization identifies who spoke when — essential for structured output from meetings, interviews, and calls.
- Word-level timestamps enable subtitle generation, precise search, and downstream analysis on every word in a recording.
- The model runs locally under Apache 2.0 licensing, making it practical for high-volume or privacy-sensitive workloads.
- Setup requires Python, PyTorch, and the Hugging Face
transformerslibrary — inference is a few dozen lines of code once you understand the output format. - For teams that want this capability without managing model infrastructure, MindStudio can connect Granite Speech 4.1 Plus into fully automated workflows with integrations, scheduling, and no custom code required. Start building for free at mindstudio.ai.