Topic

AI Audio: Voice, Speech & Music

AI for audio — real-time voice agents (Pika Me-style), text-to-speech, voice cloning (ElevenLabs), music generation (Suno, Udio), sound effects, audio editing, transcription. Anything where the output or input is audio.

May 8, 2026

Granite Speech 4.1 2BN Transcribes 1 Hour of Audio in 2 Seconds on H100 — How NLE Makes It Possible

IBM's non-autoregressive model hits a real-time factor of 1820. Here's how the NLE technique achieves that without sacrificing accuracy.

LLMs & ModelsOptimizationData & Analytics

May 8, 2026

Granite Speech 4.1 vs. Whisper X: Which ASR Model Has Better Word-Level Timestamps?

IBM claims Granite Speech 4.1 Plus beats customized Whisper X on word-level timestamps. Here's what the data actually shows.

LLMs & ModelsComparisonsData & Analytics

May 8, 2026

IBM Granite Speech 4.1: 3 Models, One Leaderboard Crown, and a 2-Second Hour of Audio

IBM's new ASR suite has three models for three use cases. The fastest transcribes an hour of audio in 2 seconds. Here's what each one does.

LLMs & ModelsWorkflowsData & Analytics

May 8, 2026

IBM Granite Speech 4.1: Three ASR Models and When to Use Each

IBM Granite Speech 4.1 offers three ASR variants for accuracy, speaker diarization, and throughput. Compare them to find the right fit for your workflow.

LLMs & ModelsComparisonsUse Cases

May 8, 2026

What Is Non-Auto-Regressive ASR? IBM Granite Speech 4.1 Explained

IBM Granite Speech 4.1's non-auto-regressive model transcribes an hour of audio in 2 seconds. Learn how NLE architecture achieves this speed.

LLMs & ModelsAI ConceptsWorkflows

May 8, 2026

How to Add Speaker Diarization and Word-Level Timestamps to Your AI Workflows

Use IBM Granite Speech 4.1 Plus to add speaker attribution and word-level timestamps to transcription workflows. Better than Whisper X for many use cases.

WorkflowsIntegrationsUse Cases

May 6, 2026

11 Labs Voice Agent via API: 4 Components Claude Code Configures Without You Touching the Dashboard

Persona, voice, knowledge base, tools — all four 11 Labs agent components configured entirely through Claude Code. Here's the full API-first workflow.

ClaudeAutomationIntegrations

May 6, 2026

How to Build a Voice Agent with 11 Labs and Cal.com Booking Using Claude Code: 45-Minute Walkthrough

No API docs, no dashboard configuration. Claude Code reads the 11 Labs docs autonomously and builds a working voice booking agent in under an hour.

ClaudeAutomationIntegrations

May 6, 2026

xAI Grok Voice API Is Live: 4 New Voice and Video Synthesis Capabilities Released This Week

xAI's voice cloning API is live without an enterprise plan. Plus Lucy 2.1 virtual try-on at $0.02/second. Here's what's new and what it costs.

LLMs & ModelsContent CreationVideo Generation

May 6, 2026

xAI Grok Voice Clone vs. Google Voice Model — Which Is More Convincing in 2026?

xAI's clone fooled thousands of listeners at near 50/50. Google's model is 'very instructable.' Here's how the two voice synthesis approaches compare.

LLMs & ModelsComparisonsContent Creation

May 5, 2026

Build a Voice Agent That Books Appointments in Under 1 Hour Using Claude Code and ElevenLabs

No API docs required. Claude Code reads the ElevenLabs docs, configures the agent, adds Cal.com booking tools, and embeds the widget for you.

ClaudeAutomationIntegrations

May 5, 2026

How to Build a Voice Agent with Claude Code and ElevenLabs in 15 Minutes

Build a fully functional voice agent using Claude Code and ElevenLabs that books calendar appointments and answers questions from your website.

WorkflowsAutomationClaude

May 5, 2026

How to Embed an AI Voice Agent Widget on Your Website with ElevenLabs

Add a voice agent to your website in minutes using ElevenLabs' widget embed code and Claude Code. Includes security best practices and cost controls.

WorkflowsIntegrationsClaude

May 5, 2026

How to Build a Voice Agent That Books Appointments via Cal.com

Connect an ElevenLabs voice agent to Cal.com using Claude Code to automatically check availability and book discovery calls from your website.

WorkflowsAutomationIntegrations

April 21, 2026

Gemini 3.1 Flash TTS in AI Studio: Hands-On First Look

A hands-on review of Gemini 3.1 Flash TTS in Google AI Studio: voice library, multi-speaker dialogue, and how to try the model free without API setup.

GeminiLLMs & ModelsUse Cases

April 19, 2026

Gemini 3.1 Flash TTS Controllability: Inline Tags Walkthrough

A deep look at Gemini 3.1 Flash TTS's inline tag system: emotion, pacing, emphasis, voice style, and pause markers — with examples for each tag type.

GeminiLLMs & ModelsAI Concepts

April 18, 2026

Gemini 3.1 Flash TTS Review: How It Compares to ElevenLabs

A direct review of Gemini 3.1 Flash TTS against ElevenLabs, OpenAI TTS, and Mistral. See which TTS model wins on cloning, control, and per-call pricing.

GeminiLLMs & ModelsAI Concepts

April 14, 2026

Find New Podcasts on Spotify Using Plain-Language AI Prompts

Use Spotify's AI playlist tool to surface podcasts you'd never browse to. Practical prompt examples and tips for getting better episode recommendations.

AI ConceptsContent CreationProductivity

April 11, 2026

Inside Spotify's AI Podcast Playlists: AI DJ to Curation

Spotify's AI podcast playlists run on the same stack as AI DJ. Here's a look at the underlying tech and how it interprets prompts as intent, not keywords.

AI ConceptsContent CreationUse Cases

April 11, 2026

What Is Pika Me? How to Have a Real-Time Video Chat With an AI Agent

Pika Me lets AI agents join Zoom calls with a face and voice. Learn how it works, what it's good for, and how it compares to other avatar tools.

Video GenerationAI ConceptsUse Cases

April 7, 2026

What Is Gemma 4's Audio Encoder? How the E2B and E4B Models Handle Speech Recognition

Gemma 4's edge models have a 50% smaller audio encoder than Gemma 3N, with 40ms frame duration for more responsive transcription. Here's how it works.

GeminiLLMs & ModelsAI Concepts

April 7, 2026

What Is Pika Me? How to Have a Real-Time Video Chat With Your AI Agent

Pika Me lets you video call your AI agent with access to your files and calendar. Here's what it can do today and what's still missing.

Multi-AgentAI ConceptsUse Cases

April 6, 2026

What Is Microsoft MAI Transcribe 1? The Speech Model That Outperforms Whisper and Gemini Flash

MAI Transcribe 1 achieves best-in-class accuracy across 25 languages and beats Whisper, Gemini Flash, and GPT Transcribe on word error rate benchmarks.

LLMs & ModelsAI ConceptsIntegrations

April 4, 2026

MAI Transcribe 1 vs OpenAI Whisper vs Gemini Flash: Which Speech Model Wins?

Compare Microsoft MAI Transcribe 1, OpenAI Whisper, and Gemini 3.1 Flash on accuracy, noise handling, and multilingual support.

LLMs & ModelsComparisonsGPT & OpenAI