AI Audio: Voice, Speech & Music
AI for audio — real-time voice agents (Pika Me-style), text-to-speech, voice cloning (ElevenLabs), music generation (Suno, Udio), sound effects, audio editing, transcription. Anything where the output or input is audio.
How to Add Speaker Diarization and Word-Level Timestamps to Your AI Workflows
Use IBM Granite Speech 4.1 Plus to add speaker attribution and word-level timestamps to transcription workflows. Better than Whisper X for many use cases.
11 Labs Voice Agent via API: 4 Components Claude Code Configures Without You Touching the Dashboard
Persona, voice, knowledge base, tools — all four 11 Labs agent components configured entirely through Claude Code. Here's the full API-first workflow.
How to Build a Voice Agent with 11 Labs and Cal.com Booking Using Claude Code: 45-Minute Walkthrough
No API docs, no dashboard configuration. Claude Code reads the 11 Labs docs autonomously and builds a working voice booking agent in under an hour.
xAI Grok Voice API Is Live: 4 New Voice and Video Synthesis Capabilities Released This Week
xAI's voice cloning API is live without an enterprise plan. Plus Lucy 2.1 virtual try-on at $0.02/second. Here's what's new and what it costs.
xAI Grok Voice Clone vs. Google Voice Model — Which Is More Convincing in 2026?
xAI's clone fooled thousands of listeners at near 50/50. Google's model is 'very instructable.' Here's how the two voice synthesis approaches compare.
Build a Voice Agent That Books Appointments in Under 1 Hour Using Claude Code and ElevenLabs
No API docs required. Claude Code reads the ElevenLabs docs, configures the agent, adds Cal.com booking tools, and embeds the widget for you.
How to Build a Voice Agent with Claude Code and ElevenLabs in 15 Minutes
Build a fully functional voice agent using Claude Code and ElevenLabs that books calendar appointments and answers questions from your website.
How to Embed an AI Voice Agent Widget on Your Website with ElevenLabs
Add a voice agent to your website in minutes using ElevenLabs' widget embed code and Claude Code. Includes security best practices and cost controls.
How to Build a Voice Agent That Books Appointments via Cal.com
Connect an ElevenLabs voice agent to Cal.com using Claude Code to automatically check availability and book discovery calls from your website.
Gemini 3.1 Flash TTS in AI Studio: Hands-On First Look
A hands-on review of Gemini 3.1 Flash TTS in Google AI Studio: voice library, multi-speaker dialogue, and how to try the model free without API setup.
Gemini 3.1 Flash TTS Controllability: Inline Tags Walkthrough
A deep look at Gemini 3.1 Flash TTS's inline tag system: emotion, pacing, emphasis, voice style, and pause markers — with examples for each tag type.
Gemini 3.1 Flash TTS Review: How It Compares to ElevenLabs
A direct review of Gemini 3.1 Flash TTS against ElevenLabs, OpenAI TTS, and Mistral. See which TTS model wins on cloning, control, and per-call pricing.
Find New Podcasts on Spotify Using Plain-Language AI Prompts
Use Spotify's AI playlist tool to surface podcasts you'd never browse to. Practical prompt examples and tips for getting better episode recommendations.
Inside Spotify's AI Podcast Playlists: AI DJ to Curation
Spotify's AI podcast playlists run on the same stack as AI DJ. Here's a look at the underlying tech and how it interprets prompts as intent, not keywords.
What Is Pika Me? How to Have a Real-Time Video Chat With an AI Agent
Pika Me lets AI agents join Zoom calls with a face and voice. Learn how it works, what it's good for, and how it compares to other avatar tools.
What Is Gemma 4's Audio Encoder? How the E2B and E4B Models Handle Speech Recognition
Gemma 4's edge models have a 50% smaller audio encoder than Gemma 3N, with 40ms frame duration for more responsive transcription. Here's how it works.
What Is Pika Me? How to Have a Real-Time Video Chat With Your AI Agent
Pika Me lets you video call your AI agent with access to your files and calendar. Here's what it can do today and what's still missing.
What Is Microsoft MAI Transcribe 1? The Speech Model That Outperforms Whisper and Gemini Flash
MAI Transcribe 1 achieves best-in-class accuracy across 25 languages and beats Whisper, Gemini Flash, and GPT Transcribe on word error rate benchmarks.
MAI Transcribe 1 vs OpenAI Whisper vs Gemini Flash: Which Speech Model Wins?
Compare Microsoft MAI Transcribe 1, OpenAI Whisper, and Gemini 3.1 Flash on accuracy, noise handling, and multilingual support.
What Is Microsoft MAI Transcribe 1? The Speech Model That Beats Whisper and Gemini
MAI Transcribe 1 is Microsoft's new speech recognition model that outperforms Whisper, Gemini Flash, and Scribe V2 across 25 languages.
Suno 5.5 vs Google Lyria 3 vs Sonauto V3: Which AI Music Generator Wins?
Suno 5.5, Google Lyria 3, and Sonauto V3 all compete for the best AI music generator title. Here's a head-to-head comparison across quality, flow, and features.
What Is Suno 5.5? Voice Cloning, Studio Features, and How It Compares to V5
Suno 5.5 adds voice cloning, a studio mode for stem editing, and custom model fine-tuning. Here's what changed from V5 and whether the upgrade is worth it.
How to Build a Voice Agent with Gemini 3.1 Flash Live and Claude Code
Learn how to embed Gemini 3.1 Flash Live into a website or phone number using Claude Code to handle API docs, WebSockets, and function calling setup.
Gemini 3.1 Flash Live vs ElevenLabs: Which Is Better for Voice Agent Deployment?
Compare Gemini 3.1 Flash Live and ElevenLabs for building production voice agents. Key differences in deployment complexity, cost, and latency.