AI Audio: Voice, Speech & Music
AI for audio — real-time voice agents (Pika Me-style), text-to-speech, voice cloning (ElevenLabs), music generation (Suno, Udio), sound effects, audio editing, transcription. Anything where the output or input is audio.
OpenAI GPT Realtime 2 vs Google Gemini TTS: Which AI Voice API Wins?
Compare OpenAI GPT Realtime 2 and Google Gemini TTS on expressiveness, speed, language support, and agentic capabilities to choose the right voice API.
How to Add Speaker Diarization to Your AI Transcription Workflow
Speaker diarization identifies who said what in audio. Learn how IBM Granite Speech 4.1 Plus adds speaker labels, word timestamps, and incremental decoding.
GPT Realtime 2 vs GPT Realtime Translate: Which Voice Model Do You Need?
OpenAI's new voice models serve different use cases. Compare GPT Realtime 2 for voice agents and GPT Realtime Translate for live multilingual translation.
What Is Speaker Diarization? How IBM Granite Speech 4.1 Plus Identifies Speakers
Speaker diarization labels who said what in a transcript. Learn how IBM Granite Speech 4.1 Plus handles speaker attribution and word-level timestamps.
How to Build a Live Translation Voice Agent with OpenAI's GPT Realtime API
GPT Realtime Translate supports 70+ input languages with real-time speech translation. Learn how to build a live translation agent using the API.
GPT Realtime 2 vs GPT Realtime Translate vs Whisper: Which Voice Model Do You Need?
OpenAI released three new realtime voice models. Compare GPT Realtime 2, Translate, and Whisper to find the right one for your voice agent.
GPT Realtime 2 Can Stay Silent on Command and Keep Listening — Here's Why That Changes Voice Agents
GPT Realtime 2 can be told to go silent, listen to a side conversation, and re-engage on command — solving the biggest friction point in live voice agents.
GPT Realtime Translate vs Traditional Real-Time Translation APIs — Is OpenAI's Pace-Matched Approach Worth It?
GPT Realtime Translate waits for verb-position keywords before translating, producing more natural dialogue. Here's how it stacks up against existing solutions.
GPT Realtime Voice Models: GPT Realtime 2, Translate, and Whisper Explained
OpenAI released three new realtime voice models with GPT-5 reasoning, live translation across 70 languages, and streaming speech-to-text. Here's what each does.
How to Build a Voice Agent with OpenAI's Realtime API: Step-by-Step Setup Guide
OpenAI's Realtime API now supports reasoning, tool calls, and interruption handling. Here's how to set up your first voice agent from scratch.
OpenAI Launches 3 New Realtime Voice API Models: What Builders Need to Know Right Now
OpenAI dropped three new realtime voice API models at once: a reasoning voice agent, a live translator, and a streaming transcription model. Here's what's new.
How to Build a Production Voice Agent with GPT Realtime 2 API: Step-by-Step Setup Guide
GPT Realtime 2 supports reasoning and parallel tool calls during voice. Here's how to set it up via API and avoid the silence problem with preambles.
How to Build a Voice Agent with Real-Time Translation Using OpenAI's API
GPT Realtime Translate supports 70+ input languages with live speech translation. Learn how to build a multilingual voice agent using OpenAI's new API.
GPT Realtime 2's 'Stay Quiet' Command Is a New Voice AI Primitive — Here's What It Unlocks
You can now tell GPT Realtime 2 to listen silently while you have a side conversation. This single feature changes how voice agents handle real meetings.
GPT Realtime Translate vs Traditional Interpretation: Is 70-Language Live AI Translation Ready for Production?
GPT Realtime Translate handles 70+ languages and maintains speaker pace. Here's how it compares to traditional interpretation pipelines for real use cases.
GPT Realtime Voice Models Explained: GPT Realtime 2, Translate, and Whisper
OpenAI released three new realtime voice models via API. Here's what GPT Realtime 2, Realtime Translate, and Realtime Whisper do and when to use each.
IBM Granite Speech 4.1 Transcribes an Hour of Audio in 2 Seconds: 5 Things That Make It Different
IBM's Granite Speech 4.1 hits 1820x real-time speed and leads the Hugging Face ASR leaderboard at 5.33% WER. Here's what makes the architecture different.
IBM Granite Speech 4.1 vs Whisper X: Should You Switch Your Transcription Pipeline?
Granite Speech 4.1 Plus beats customized Whisper X on word-level timestamps and leads the open ASR leaderboard. Here's when to switch and when to stay.
OpenAI's 3 New Real-Time Voice Models: What Each One Does and How to Access Them via API
OpenAI dropped three real-time voice models at once. Here's what GPT Realtime 2, Translate, and Whisper each do and how to get API access today.
Granite Speech 4.1 2BN Transcribes 1 Hour of Audio in 2 Seconds on H100 — How NLE Makes It Possible
IBM's non-autoregressive model hits a real-time factor of 1820. Here's how the NLE technique achieves that without sacrificing accuracy.
Granite Speech 4.1 vs. Whisper X: Which ASR Model Has Better Word-Level Timestamps?
IBM claims Granite Speech 4.1 Plus beats customized Whisper X on word-level timestamps. Here's what the data actually shows.
IBM Granite Speech 4.1: 3 Models, One Leaderboard Crown, and a 2-Second Hour of Audio
IBM's new ASR suite has three models for three use cases. The fastest transcribes an hour of audio in 2 seconds. Here's what each one does.
IBM Granite Speech 4.1: Three ASR Models and When to Use Each
IBM Granite Speech 4.1 offers three ASR variants for accuracy, speaker diarization, and throughput. Compare them to find the right fit for your workflow.
What Is Non-Auto-Regressive ASR? IBM Granite Speech 4.1 Explained
IBM Granite Speech 4.1's non-auto-regressive model transcribes an hour of audio in 2 seconds. Learn how NLE architecture achieves this speed.