How to Build a Voice Agent with Gemini 3.1 Flash Live and Claude Code

Why Building Voice Agents Just Got Much Easier

Building a real-time voice agent has historically required three separate services: a speech-to-text model, a language model, and a text-to-speech engine. Each adds latency, cost, and another failure point to manage.

Gemini Flash Live changes that equation. It handles all three in a single WebSocket session — audio in, audio out, with natural conversation flow and function calling built in. Pair it with Claude Code, Anthropic’s terminal-based agentic coding assistant, and you can go from zero to a working voice agent — embedded in a website or connected to a phone number — in a single focused day.

This guide walks through the full build: configuring the Gemini Live API, writing the WebSocket server, setting up function calling, building the browser frontend, and wiring up a Twilio phone number. Each step uses Claude Code to handle the heavy lifting.

What Gemini Flash Live Actually Does

Gemini Flash Live is Google’s real-time multimodal streaming model. Unlike standard LLM API calls where you send a complete request and wait for a complete response, Flash Live maintains a persistent WebSocket connection that streams audio in both directions simultaneously.

The model receives audio from a user’s microphone, processes it in real time, and streams spoken audio back — with low enough latency to feel like a natural conversation.

Key capabilities:

Bidirectional streaming — Audio flows in and out over a single WebSocket session
Interruption handling — Users can speak while the model is responding; Flash Live detects this and stops mid-sentence
Function calling — The model triggers structured tool calls mid-conversation, enabling it to look up data, book appointments, or call external APIs
Context persistence — The session maintains conversation history automatically

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

goremy.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

In practice, this means you’re not managing three separate APIs and synchronizing their outputs. One connection handles the entire voice interaction pipeline. The Gemini Live API documentation covers the full protocol, but this guide translates the relevant parts into a working implementation.

Why Claude Code Speeds This Up

Claude Code is Anthropic’s agentic coding assistant that runs in your terminal. It reads files, writes code, executes commands, and processes documentation — in a continuous work loop without constant hand-holding.

For a Gemini Flash Live voice agent, it’s valuable for three specific reasons.

Reading and applying API docs. The Gemini Live API has a multi-step setup: WebSocket handshake, session configuration, audio format specification, tool registration. Claude Code reads the API reference and translates it directly into working code.

Eliminating boilerplate. WebSocket event handlers, PCM16 audio encoding, stream chunking, session lifecycle management — tedious to write manually, and easy to get wrong. Claude Code produces this correctly on the first or second pass.

Debugging efficiently. When you hit a cryptic WebSocket close code or an audio format error, Claude Code reads your error logs, compares them against the spec, and identifies the fix. No forum searches required.

This combination cuts a multi-day infrastructure project down to a few focused hours.

Prerequisites

Have these ready before writing any code:

Google AI Studio account — Generate an API key at ai.google.dev. Gemini Flash Live requires an active paid tier or credits.
Node.js 18+ — This guide uses Node.js. The approach translates to Python with minimal changes.
Claude Code — Install via npm: npm install -g @anthropic-ai/claude-code. Requires an Anthropic API key.
Basic terminal familiarity — You’ll run commands, read logs, and review generated code.
HTTPS or localhost — Browsers require HTTPS (or localhost) for microphone access.

Optional but useful:

Twilio account — For connecting the agent to a phone number (covered in Step 6).
ngrok — For exposing your local server during development without a full deployment.

Step 1: Scaffold the Project with Claude Code

Create a new directory and start Claude Code inside it:

mkdir gemini-voice-agent
cd gemini-voice-agent
claude

Give Claude Code a specific, detailed prompt. Vague prompts produce vague code:

“Create a Node.js project that connects to the Gemini Live API using WebSockets. The backend should accept WebSocket connections from a browser client, forward audio to Gemini Live in PCM16 format at 16kHz, and stream Gemini’s audio responses back to the client. Use the @google/genai package. Create a basic HTML frontend that captures microphone input using the Web Audio API and plays back audio responses.”

Claude Code will initialize a package.json, install dependencies (ws, @google/genai, express), create a server.js, and generate an index.html with a basic UI.

Review the generated files before running anything. Check that the model name, endpoint URL, and audio format parameters match the current Gemini Flash Live spec. If something looks off, tell Claude Code exactly what needs correcting and it will revise the relevant section.

Step 2: Configure the Gemini Live Session

The backend connection needs three things configured correctly: the model, the audio format, and the system instructions.

Here’s the structure of a basic session initialization:

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const session = await ai.live.connect({
  model: 'gemini-2.5-flash-preview-native-audio-dialog',
  config: {
    responseModalities: ['AUDIO'],
    speechConfig: {
      voiceConfig: {
        prebuiltVoiceConfig: { voiceName: 'Aoede' }
      }
    },
    systemInstruction: {
      parts: [{ 
        text: 'You are a helpful customer support agent. Be concise and professional.'
      }]
    }
  }
});

Ask Claude Code to add your system instruction and tune the turn detection settings:

“Add a system instruction that defines this as a customer support agent for a software company. Set input audio to PCM16 at 16kHz mono, enable automatic turn detection with a 500ms silence threshold, and log the session ID when the connection opens.”

The turn detection configuration (voice activity detection) is worth getting right early. Set the silence threshold too low and the model interrupts itself; set it too high and the conversation feels sluggish. 500ms is a reasonable starting point for most use cases.

Step 3: Set Up Function Calling

Function calling is what separates a voice agent from a voice demo. Without tools, the model can only talk. With tools, it can act.

Define function declarations during session initialization:

const tools = [
  {
    functionDeclarations: [
      {
        name: 'lookup_order_status',
        description: 'Look up the current status of a customer order. Use this when a customer asks about their order.',
        parameters: {
          type: 'OBJECT',
          properties: {
            order_id: {
              type: 'STRING',
              description: 'The order ID, typically 6–8 digits'
            }
          },
          required: ['order_id']
        }
      },
      {
        name: 'create_support_ticket',
        description: 'Create a new support ticket when a customer reports an unresolved issue',
        parameters: {
          type: 'OBJECT',
          properties: {
            subject: { type: 'STRING' },
            description: { type: 'STRING' },
            priority: { type: 'STRING', enum: ['low', 'medium', 'high'] }
          },
          required: ['subject', 'description']
        }
      }
    ]
  }
];

The description field is critical. Gemini uses it to decide when to call a function. “Look up order status” is weaker than “Look up the current status of a customer order. Use this when a customer asks about their order.” Be specific.

Tell Claude Code to implement the full function call response loop:

“Add the function call handler to the session. When Gemini emits a toolCall event, check the function name, call the appropriate handler, and send back a toolResponse with the result.”

The loop works like this:

Gemini sends a toolCall event with the function name and arguments
Your server executes the actual function (API call, database query, etc.)
Your server sends a toolResponse back with the result
Gemini continues the conversation using the returned data

Get this loop right and your agent can handle real tasks mid-conversation without interrupting the audio flow.

Step 4: Build the Browser Frontend

The frontend has three jobs: capture microphone audio, stream it to your server, and play back Gemini’s responses.

Claude Code will generate the core structure. The capture side looks like this:

const ws = new WebSocket('wss://yourserver.com/voice');

const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext({ sampleRate: 16000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);

processor.onaudioprocess = (e) => {
  const inputData = e.inputBuffer.getChannelData(0);
  const pcm16 = convertFloat32ToPCM16(inputData);
  if (ws.readyState === WebSocket.OPEN) {
    ws.send(pcm16);
  }
};

source.connect(processor);
processor.connect(audioContext.destination);

The convertFloat32ToPCM16 conversion is essential — browsers capture audio as 32-bit floats, but Gemini expects 16-bit integers. Claude Code will generate this correctly; it’s a standard utility function.

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

For playback, avoid using a basic <audio> element with streaming chunks. The result is choppy. Ask Claude Code to implement an AudioContext-based playback queue instead — it handles buffer scheduling and produces smooth, continuous audio output.

To make the frontend embeddable anywhere, wrap the logic in a JavaScript class:

<script src="/voice-agent.js"></script>
<script>
  const agent = new VoiceAgent({ 
    serverUrl: 'wss://yourserver.com/voice',
    triggerElement: '#start-button'
  });
  agent.init();
</script>

Add the script tag to any page, include a button with the matching ID, and the agent is live. Browsers require a user gesture before granting microphone access, so always trigger initialization from a click event — not automatically on page load.

Step 5: Connect to a Phone Number

For phone access, Twilio Programmable Voice streams raw phone call audio to a WebSocket endpoint — the same server you’ve already built.

The architecture:

A call comes in to your Twilio number
Twilio fetches a TwiML response instructing it to open a media stream to your WebSocket endpoint
Your server receives mulaw audio at 8kHz and converts it to PCM16 at 16kHz before forwarding to Gemini
Gemini’s PCM16 response is converted back to mulaw at 8kHz and sent to Twilio
Twilio plays it over the live call

Add a TwiML webhook route:

app.post('/twiml', (req, res) => {
  res.type('text/xml');
  res.send(`
    <Response>
      <Connect>
        <Stream url="wss://yourserver.com/twilio-voice"/>
      </Connect>
    </Response>
  `);
});

Then give Claude Code precise instructions for the audio conversion:

“Add a Twilio media stream handler for the /twilio-voice WebSocket path. Convert incoming mulaw 8kHz audio to PCM16 16kHz before sending to Gemini. Convert Gemini’s PCM16 16kHz response back to mulaw 8kHz before sending to Twilio.”

Specify the sample rates and codec names explicitly — mulaw is different from MP3 or WAV, and getting the direction of conversion wrong is a common source of unintelligible audio. Once deployed, point your Twilio number’s webhook URL at the /twiml route and calls route through the agent automatically.

Step 6: Handle Common Errors

Testing will surface a predictable set of issues. Here are the ones that come up most often.

WebSocket closes immediately. Usually authentication. Verify the Gemini API key is set correctly, the Live API is enabled in your Google Cloud project, and the model name in your code matches the current API version exactly.

Audio sounds garbled or choppy. Almost always a format mismatch. Confirm input is PCM16 at exactly 16kHz, mono channel. Check you’re not double-encoding — sending base64 when the API expects binary, or vice versa.

Function calls never trigger. Tighten the tool descriptions. Gemini relies on the description field to decide when to invoke a function. Also confirm you’re listening for the correct event type name, which varies between SDK versions.

High latency on phone calls. Run the mulaw conversion in a worker thread using Node’s worker_threads module. Audio conversion on the main event loop blocks other operations and adds perceptible delay. Ask Claude Code to refactor this specifically.

Model cuts off mid-sentence. Voice activity detection is too sensitive. Increase the silence threshold in the session config to give the model more time to finish responses before the turn ends.

Where MindStudio Fits

Other agents start typing. Remy starts asking.

YOU SAID "Build me a sales CRM."

REMY ASKS

01 DESIGN Should it feel like Linear, or Salesforce?

02 UX How do reps move deals — drag, or dropdown?

03 ARCH Single team, or multi-org with permissions?

Scoping, trade-offs, edge cases — the real work. Before a line of code.

Building this stack from scratch gives you full control — custom audio routing, precise function call logic, and a frontend that matches your exact requirements. If that level of control is what you need, this guide gets you there.

But if your goal is a working voice agent connected to real business tools — CRMs, ticketing systems, calendars, databases — without spending time on WebSocket servers and audio format conversion, MindStudio offers a different path. MindStudio’s visual workflow builder supports Gemini models natively and connects them to 1,000+ pre-built integrations without touching any audio infrastructure.

For teams building AI agents that need to act across multiple business systems, not just talk, MindStudio handles the orchestration layer so you can focus on what the agent actually does.

For developers who want the best of both approaches, the MindStudio Agent Skills Plugin (@mindstudio-ai/agent npm SDK) lets Claude Code or any other agentic system call MindStudio’s capabilities as typed method calls. You build the custom voice frontend from this guide, handle the Gemini Live audio layer yourself, and offload complex tool execution — like agent.runWorkflow() for multi-step CRM processes — to MindStudio. The plugin handles rate limiting, retries, and auth automatically.

You can try MindStudio free at mindstudio.ai.

Frequently Asked Questions

Does Gemini Flash Live support real-time two-way audio?

Yes. Gemini Flash Live maintains a persistent WebSocket connection that streams audio in both directions simultaneously. The model receives your audio input and returns spoken audio output in real time, with latency low enough for natural conversation. You don’t need a separate speech-to-text or text-to-speech service — the model handles both ends.

What audio format does the Gemini Live API require?

Input audio must be PCM16 (16-bit linear PCM) at 16kHz, mono channel. Browser audio captured via the Web Audio API is 32-bit float by default, so a conversion step is required before sending. Output audio from Gemini is also PCM16, though sample rate may vary by voice configuration. Always check the session response metadata to confirm output format before building your playback pipeline.

Can I use function calling during a Gemini Live voice conversation?

Yes. Function calling is fully supported in Gemini Live sessions. You register tool schemas — name, description, and parameter definitions — when initializing the session. During a conversation, the model emits toolCall events when it decides a function should run. Your server executes the function, returns results via toolResponse, and the model continues the conversation using that data. This is what allows voice agents to look up orders, create tickets, or check availability without breaking the audio flow.

How do I embed a Gemini voice agent in a website?

Wrap your client-side code in a JavaScript class that initializes on demand, then include it via a script tag. Attach the initialization to a button click — browsers require a user gesture before granting microphone access. HTTPS is required in production for getUserMedia to function. The script tag plus button approach works on any web page without requiring a framework.

How do I connect a Gemini voice agent to a phone number?

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

Use Twilio Programmable Voice. Twilio streams mulaw audio from a live call to a WebSocket endpoint on your server. The main challenge is audio format conversion: phone calls use mulaw at 8kHz, while Gemini expects PCM16 at 16kHz. Your server handles resampling and codec conversion in both directions. A TwiML <Connect><Stream> directive tells Twilio where to route the audio stream.

How does Claude Code specifically help with building voice agents?

Claude Code’s main advantage here is reading API documentation and translating it into correct implementation code without hand-holding. The Gemini Live API has enough moving parts — session lifecycle, audio format requirements, function call event handling, WebSocket error codes — that writing everything from scratch takes significant time. Claude Code handles the boilerplate accurately, debugs format and connection errors based on actual log output, and iterates quickly when requirements change. It’s most useful for initial scaffolding and for diagnosing issues during testing.

Key Takeaways

Gemini Flash Live handles speech recognition, language model reasoning, and speech synthesis in a single WebSocket session — no need to orchestrate three separate APIs.
Claude Code accelerates the build by reading API documentation and generating working WebSocket, audio pipeline, and function calling code with minimal manual effort.
Function calling is the difference between a conversational demo and a real voice agent — clear tool descriptions and a correct response loop are the two things that matter most.
Phone integration via Twilio requires mulaw-to-PCM16 audio conversion in both directions; this is the most common friction point when extending beyond the browser.
For teams who want a voice-capable agent connected to business tools without managing audio infrastructure, MindStudio provides Gemini model support and 1,000+ integrations out of the box.

The infrastructure side of this build is a one-day project for an experienced developer. The real value is in what your agent does once it’s running. Start with a focused use case, get the function calls working, and iterate from there. If you’d rather skip the infrastructure entirely and focus on agent behavior, MindStudio is worth exploring as a starting point.