How to Build a Production Voice Agent with GPT Realtime 2 API: Step-by-Step Setup Guide

Your Voice Agent Will Go Silent Mid-Thought — Here’s How to Fix It Before It Ships

The silence problem kills voice agents in production. You ask the model to pull CRM data, it fires off a tool call, and then — nothing. Three seconds of dead air while the user wonders if the call dropped. Within 30 minutes of setting up GPT Realtime 2 via the API, you can have a voice agent that narrates its own reasoning, handles interruptions gracefully, and stays in the conversation even when it’s working. The fix has a name: the preamble technique. It’s the single most important pattern to get right before you ship anything with this API.

This post is a setup guide anchored on that pattern. It assumes you’re building something real — a CRM assistant, a scheduling agent, a customer-facing voice interface — and that you’ve already decided GPT Realtime 2 is the right model for it.

What You’re Actually Building

GPT Realtime 2 is OpenAI’s voice agent model with GPT-5 class reasoning and parallel tool calling. It handles interruptions, maintains conversation state, and can be told to listen silently without responding — a genuinely new interaction primitive for voice AI that we’ll cover in the troubleshooting section.

How Remy works. You talk. Remy ships.

YOU14:02

Build me a sales CRM with a pipeline view and email integration.

REMY14:03 → 14:11

Scoping the project

Wiring up auth, database, API

Building pipeline UI + email integration

Running QA tests

✓ Live at yourapp.msagent.ai

The demo OpenAI published at platform.openai.com/audio/realtime shows the practical shape of this: a user asks the model to check their calendar, the model responds with “You have a meeting with Sable Crust Robotics in 12 minutes,” then gets asked to update the CRM. The model calls the tool, retrieves context — “Sablerest launched warehouse automation this morning. Expansion is active. Security review is the blocker.” — and reads it back. Clean, continuous, no silence.

That smoothness doesn’t happen by default. It requires deliberate prompt engineering around the preamble pattern.

The three new real-time models (GPT Realtime 2, GPT Realtime Translate, and GPT Realtime Whisper) are API-only at launch — not yet available in ChatGPT or the Codex app. You access them through the OpenAI Realtime API, which uses WebSockets rather than the standard REST interface.

What You Need Before Starting

Accounts and access:

An OpenAI account with API access and billing enabled. The demo at platform.openai.com/audio/realtime uses your API credits, so you’ll see costs immediately.
Node.js 18+ or Python 3.11+ on your machine.
A microphone and speaker setup you can test locally.

Knowledge prerequisites:

Familiarity with WebSocket connections. The Realtime API is not a standard HTTP request/response cycle.
Basic understanding of OpenAI’s function calling / tool use pattern. GPT Realtime 2 supports parallel tool calling, which means multiple tools can fire simultaneously — your tool handlers need to be ready for that.
Some experience with audio streaming. You’re dealing with PCM audio at 24kHz, 16-bit, mono by default.

Packages:

For Node.js: openai SDK version 4.57.0 or later (this is when Realtime API support landed properly), plus ws for WebSocket handling if you’re not using the SDK’s built-in client.
For Python: openai>=1.40.0, websockets, and pyaudio for microphone capture.

If you want to skip the audio plumbing and test the model logic first, the playground at platform.openai.com/audio/realtime is the fastest path. It’s connected to your API key and will bill accordingly.

Building the Agent: Step by Step

Step 1: Establish the WebSocket connection

The Realtime API endpoint is wss://api.openai.com/v1/realtime. You authenticate with your API key in the header, not as a query parameter.

const WebSocket = require('ws');

const ws = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-realtime-2', {
  headers: {
    'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
    'OpenAI-Beta': 'realtime=v1'
  }
});

The OpenAI-Beta: realtime=v1 header is required. Omit it and you’ll get a 400 immediately.

Once connected, you’ll receive a session.created event. That’s your signal that the session is live and you can start configuring it.

Now you have: An open WebSocket connection to a GPT Realtime 2 session.

Step 2: Configure the session with your system prompt and tools

Send a session.update event immediately after session.created. This is where you define the model’s voice, turn detection settings, and — critically — your tools.

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    modalities: ['text', 'audio'],
    instructions: YOUR_SYSTEM_PROMPT,
    voice: 'alloy',
    input_audio_format: 'pcm16',
    output_audio_format: 'pcm16',
    input_audio_transcription: {
      model: 'whisper-1'
    },
    turn_detection: {
      type: 'server_vad',
      threshold: 0.5,
      prefix_padding_ms: 300,
      silence_duration_ms: 500
    },
    tools: YOUR_TOOL_DEFINITIONS,
    tool_choice: 'auto'
  }
}));

The turn_detection block controls how the model decides when you’ve finished speaking. server_vad (voice activity detection) is the right choice for most production scenarios. The silence_duration_ms value of 500ms is a reasonable starting point — go lower for snappier responses, higher if your users tend to pause mid-sentence.

Now you have: A configured session that knows about your tools and is listening for audio input.

Step 3: Write the preamble into your system prompt

Other agents start typing. Remy starts asking.

YOU SAID "Build me a sales CRM."

REMY ASKS

01 DESIGN Should it feel like Linear, or Salesforce?

02 UX How do reps move deals — drag, or dropdown?

03 ARCH Single team, or multi-org with permissions?

Scoping, trade-offs, edge cases — the real work. Before a line of code.

This is the step most tutorials skip, and it’s the one that matters most.

When GPT Realtime 2 calls a tool, there’s a gap between the tool call firing and the result coming back. During that gap, the model is silent by default. For a voice agent, silence is death — users assume the call dropped, they start talking, they interrupt the tool call.

The preamble technique is the fix. You instruct the model to narrate what it’s doing before and during tool calls. Here’s the relevant section of a system prompt that handles this:

When you are about to call a tool, always speak a brief acknowledgment first. 
For example: "Let me pull that up for you" or "Checking your calendar now" or 
"I'll update the CRM with that." 

While waiting for tool results, you may say something like "Just a moment..." 
or "Getting that information..." to keep the conversation active.

When tool results arrive, read the relevant parts back naturally rather than 
reciting raw data.

The demo from OpenAI’s own presentation made this explicit: “Actions can take a few seconds, so it’s very important for the model to acknowledge those. With GPT Realtime 2, you can communicate directly during the reasoning and the tool calling so the user stays informed.”

This isn’t just UX polish. Without preamble, users interrupt tool calls because they think the model has stalled. That creates a cascade of partial results and confused state. The preamble keeps the conversation coherent.

Now you have: A system prompt that prevents the silence problem before it starts.

Step 4: Handle parallel tool calls correctly

GPT Realtime 2 can fire multiple tool calls simultaneously. If your system prompt asks it to “check my calendar and update the CRM,” it may call both tools at once rather than sequentially.

Your event handler needs to track tool calls by their call_id and respond to each one individually:

const pendingToolCalls = {};

ws.on('message', (data) => {
  const event = JSON.parse(data);
  
  if (event.type === 'response.function_call_arguments.done') {
    const { call_id, name, arguments: args } = event;
    pendingToolCalls[call_id] = { name, args: JSON.parse(args) };
    
    // Execute the tool
    executeToolCall(name, JSON.parse(args)).then(result => {
      ws.send(JSON.stringify({
        type: 'conversation.item.create',
        item: {
          type: 'function_call_output',
          call_id: call_id,
          output: JSON.stringify(result)
        }
      }));
      
      // Trigger the model to continue
      ws.send(JSON.stringify({ type: 'response.create' }));
    });
  }
});

The response.create event after submitting tool output is what tells the model to continue speaking. Omit it and the model waits indefinitely.

Now you have: Parallel tool call handling that won’t deadlock when multiple tools fire at once.

Step 5: Stream audio in and out

For local testing, you can use node-record-lpcm16 to capture microphone input and speaker to play audio output. For production, you’ll be routing audio through your telephony stack (Twilio, Vonage, etc.) or a WebRTC layer.

The key constraint: audio must be PCM16 at 24kHz, mono. Most telephony systems use 8kHz µ-law. You’ll need to resample. ffmpeg handles this reliably:

ffmpeg -i input.wav -ar 24000 -ac 1 -f s16le output.pcm

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

remy.msagent.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

For real-time resampling in Node.js, node-libsamplerate works, though it adds a few milliseconds of latency. Budget for that in your turn detection thresholds.

Now you have: A complete audio pipeline from microphone to model and back.

Step 6: Test the preamble in the playground first

Before wiring up your full audio stack, test your system prompt at platform.openai.com/audio/realtime. This is the fastest feedback loop. You can hear exactly how the model narrates tool calls, whether the preamble sounds natural, and whether the voice detection thresholds feel right.

The playground uses your API credits, so keep sessions short during iteration. A 2-minute test session with a few tool calls will cost a few cents at current pricing.

Now you have: A validated system prompt you can carry into your production implementation.

The Failure Modes You’ll Actually Hit

The model stops mid-sentence when you interrupt. This is expected behavior — GPT Realtime 2 handles interruptions by design. The issue is usually that your silence_duration_ms is too low, causing the model to treat your breathing or background noise as an interruption. Raise it to 700-800ms and see if that stabilizes things.

Tool calls fire but the model never reads the results. You forgot the response.create event after submitting tool output. This is the most common mistake. The model doesn’t automatically continue — you have to tell it to.

Parallel tool calls return out of order. Your tool handlers are async and one resolved faster than the other. Track by call_id as shown in Step 4. Don’t assume tool results arrive in the order the calls were made.

The model is too chatty during tool calls. Your preamble instruction is too open-ended. Constrain it: “Keep acknowledgments to one sentence. Don’t speculate about what the tool will return.” The model will otherwise narrate extensively, which feels unnatural.

Audio quality degrades after a few minutes. This is usually a buffer management issue in your audio pipeline, not a model problem. Check that you’re flushing your PCM buffer on each response.audio.delta event rather than accumulating.

The “stay quiet” command doesn’t work as expected. GPT Realtime 2 supports being told to listen silently — “stay quiet until I say back to demo” — and re-engaging on command. This works, but the model needs explicit instruction in the system prompt that this is an acceptable state. Add something like: “If the user asks you to stay quiet or listen silently, do so without speaking until they explicitly invite you back into the conversation.” Without this, the model may interpret silence requests as conversational and respond to them.

If you’re building agents that need to stay running continuously while you work on other things, the patterns in how to keep your Claude Code agent running 24/7 apply here too — session management and reconnection logic matter for long-running voice sessions. Similarly, if you’re evaluating which underlying model to pair with your voice layer, GPT-5.4 vs Claude Opus 4.6 covers the tradeoffs in reasoning quality and latency that directly affect voice agent responsiveness.

Where to Take This Further

Add GPT Realtime Whisper for transcription logging. GPT Realtime Whisper is the third model in this release — streaming speech-to-text that transcribes as the speaker talks. Running it alongside GPT Realtime 2 gives you a transcript of every session, which is essential for debugging and compliance. They’re separate API calls, but you can pipe the same audio stream to both.

Implement GPT Realtime Translate for multilingual support. GPT Realtime Translate handles live translation from 70+ input languages to 13 output languages while maintaining speaker pace. The interesting implementation detail: it waits for the verb (the keyword that completes the sentence’s meaning) before starting translation, which produces near-simultaneous interpretation that feels like natural dialogue rather than lagged dubbing. If your voice agent serves a multilingual user base, this is worth evaluating as a layer on top of your GPT Realtime 2 session.

Build a memory layer. Voice agents that don’t remember context across sessions feel broken. The patterns for building a self-evolving memory system with Obsidian and hooks translate directly to voice agent memory — capture session transcripts, extract key facts, inject them into the next session’s system prompt.

Connect to your actual business tools. The CRM demo from OpenAI’s presentation — where the model retrieved “Sablerest launched warehouse automation this morning. Expansion is active. Security review is the blocker.” — is only useful if your tool definitions actually connect to your CRM. MindStudio offers a practical path here: it’s an enterprise AI platform with 200+ models, 1,000+ pre-built integrations, and a visual builder for orchestrating agents and workflows, which means you can wire up CRM, calendar, and ticketing tools without writing the integration layer yourself.

Consider what happens when the voice agent needs a full application around it. A voice interface is rarely the whole product — you usually need a dashboard, session history, user management, and a backend to store tool results. Remy takes a different approach to that problem: you write a spec in annotated markdown and it compiles the full-stack application from it — TypeScript backend, database, auth, and deployment included. The spec is the source of truth; the code is derived output. Worth knowing about when the voice agent prototype needs to become a product.

Tune your turn detection for your specific use case. The defaults work for general conversation. If your users are domain experts who use long technical terms or pause frequently while thinking, you’ll want to adjust silence_duration_ms upward and possibly lower the VAD threshold to be more sensitive to quieter speech. There’s no substitute for testing with real users in real conditions. For teams building more complex automation on top of voice — where the agent needs to take browser-based actions in response to what it hears — browser automation with Playwright is worth reading alongside this guide.

Sam Altman’s framing for why this matters: “People are really starting to use voice to interact with AI, especially when they have a lot of context to dump.” The preamble technique is what makes that context dump feel like a conversation rather than a form submission. Get that right and the rest of the implementation is plumbing.

The playground at platform.openai.com/audio/realtime is the fastest way to validate your system prompt before you write a line of WebSocket code. Start there.