How to Build a Voice Agent with OpenAI's Realtime API: Step-by-Step Setup Guide

Your First Voice Agent Is About 30 Minutes Away

You’ve probably watched the demo. The one at platform.openai.com/audio/realtime where the agent reads a calendar, updates a CRM, and — this is the part that actually matters — stays completely silent while two humans have a side conversation, then re-engages the moment it hears “back to demo.” If you’ve ever tried to demo a voice AI in a real meeting and spent half the time muting your microphone so the model doesn’t interrupt, you understand immediately why that capability changes things.

This post is a setup guide. By the end of it, you’ll have a working voice agent running against GPT Realtime 2, connected to the OpenAI Realtime API, with a clear picture of what to build next. The whole setup — account, keys, first session — takes under 30 minutes if you’ve done API work before.

Why a Voice Agent Is Worth Building Right Now

Sam Altman’s framing when GPT Realtime 2 launched was specific: “People are really starting to use voice to interact with AI, especially when they have a lot of context to dump.” That’s the actual use case. Not novelty. Not accessibility theater. Context throughput.

Typing is slow. When you need to brief an agent on a customer situation, a project status, or a meeting that just happened, voice is 3–5x faster. The bottleneck in most agentic workflows isn’t the model — it’s getting context into the model fast enough to be useful.

GPT Realtime 2 is specifically built for this. It runs GPT-5-class reasoning, handles parallel tool calling (so it can query your calendar and your CRM simultaneously while keeping the conversation going), and manages interruptions gracefully. The demo scenario — read calendar, update CRM, go silent on command — isn’t a toy. It’s a template for a real personal assistant agent.

The translation angle is also worth flagging even if it’s not your immediate use case. GPT Realtime Translate supports 70+ input languages and 13 output languages, and it waits for the verb position in a sentence before beginning translation. That’s a subtle but important design choice: it produces natural-sounding dialogue instead of the choppy word-by-word output you get from naive streaming approaches. If you’re building for multilingual teams or international customer support, that matters a lot.

What You Need Before You Start

An OpenAI account with API access. Not a ChatGPT subscription — an API account at platform.openai.com. These are separate products. GPT Realtime 2 is not yet available in the ChatGPT consumer app or the Codex app as of this writing. It lives in the API only.

API credits. The Realtime API bills per session. The limited demo at platform.openai.com/audio/realtime uses your API credits, so you’ll see charges. Budget a few dollars for testing. Audio tokens are priced differently from text tokens — check the current pricing page before you run long sessions.

A working microphone and a browser that supports WebRTC. Chrome or Edge work reliably. Safari has historically been finicky with WebRTC audio constraints.

Basic familiarity with REST APIs or WebSockets. The Realtime API uses a persistent WebSocket connection, not the standard request/response pattern. If you’ve used the standard Chat Completions API, the mental model is different enough to trip you up. If you haven’t done WebSocket work before, budget an extra hour.

Optional but useful: Node.js 18+ if you want to run the official OpenAI Realtime API reference implementation locally. The playground at platform.openai.com/audio/realtime is the fastest path to a working demo, but local control is where you’ll want to be once you’re building something real.

Setting Up Your First Voice Agent Session

Step 1: Get your API key and verify access

Log into platform.openai.com. Navigate to API Keys and create a new key. Copy it somewhere safe — you won’t see it again.

Before writing any code, verify you have access to the Realtime models. Go to platform.openai.com/audio/realtime directly. If you see the demo interface, your account has access. If you see a waitlist or access error, you’ll need to request access through the standard API access process.

Now you have a confirmed API key and verified Realtime API access.

Step 2: Run the playground demo and understand the session model

Before writing a line of code, spend 10 minutes in the playground. This is not optional — the session model for the Realtime API is different from standard completions, and seeing it work first makes the code make sense.

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

remy.msagent.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

In the playground, you’ll see the calendar/CRM demo scenario. Try the “be quiet” command specifically. Tell the agent to stay silent, say a few sentences to nobody in particular, then tell it to resume. Watch what happens. The agent listens continuously but suppresses output until you give it the re-engagement signal. This is interruption handling working correctly — the model is tracking the conversation even when it’s not speaking.

This behavior is what makes voice agents usable in real contexts like meetings, live demos, or customer calls where you need to have a side conversation without the agent jumping in. For a broader look at how AI agents are being applied to personal productivity scenarios like this, 6 AI Agents for Personal Productivity covers several patterns that translate directly to voice-first workflows.

Now you have a concrete mental model of what the API is doing before you touch code.

Step 3: Set up a local WebSocket client

The Realtime API communicates over a persistent WebSocket connection. Here’s the minimal setup in Node.js:

import WebSocket from "ws";

const url = "wss://api.openai.com/v1/realtime?model=gpt-realtime-2";

const ws = new WebSocket(url, {
  headers: {
    Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
    "OpenAI-Beta": "realtime=v1",
  },
});

ws.on("open", () => {
  console.log("Connected to Realtime API");
  
  // Send session configuration
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      modalities: ["text", "audio"],
      instructions: "You are a helpful assistant. When told to be quiet or stay silent, stop speaking and listen until told to resume.",
      voice: "alloy",
      input_audio_format: "pcm16",
      output_audio_format: "pcm16",
      turn_detection: {
        type: "server_vad",
        threshold: 0.5,
        prefix_padding_ms: 300,
        silence_duration_ms: 500
      }
    }
  }));
});

ws.on("message", (data) => {
  const event = JSON.parse(data);
  console.log("Event:", event.type);
});

The OpenAI-Beta: realtime=v1 header is required. Without it, the connection will fail silently or return an error that doesn’t make the missing header obvious.

Now you have a WebSocket connection that can send and receive events from the Realtime API.

Step 4: Configure your system prompt for the calendar/CRM scenario

The demo scenario — reading a calendar and updating a CRM — requires two things: a system prompt that describes the agent’s role, and tool definitions that tell the model what actions it can take.

Here’s a system prompt modeled on the demo:

You are a personal assistant with access to the user's calendar and CRM. 
When asked about upcoming meetings, check the calendar tool and report back concisely.
When asked to update the CRM, use the update_crm tool and confirm what was changed.

Important: If the user tells you to be quiet, stay silent, or says they need a moment, 
stop speaking immediately. Continue listening. Do not speak again until the user 
explicitly tells you to resume or asks you a question.

Before taking any action (like checking the calendar or updating the CRM), 
briefly tell the user what you're about to do. Actions can take a few seconds, 
and the user should know you're working.

That last paragraph is the preamble pattern mentioned in the demo — the model explains itself before tool calls so the user isn’t staring at silence wondering if something broke. With parallel tool calling, the model can be querying multiple systems simultaneously, and a brief “let me pull that up” keeps the conversation feeling alive.

Now you have a system prompt that handles the core interaction patterns.

Step 5: Define your tools

How Remy works. You talk. Remy ships.

YOU14:02

Build me a sales CRM with a pipeline view and email integration.

REMY14:03 → 14:11

Scoping the project

Wiring up auth, database, API

Building pipeline UI + email integration

Running QA tests

✓ Live at yourapp.msagent.ai

For the calendar/CRM scenario, you need at minimum two tool definitions. These follow the same JSON Schema format as the standard function calling API:

const tools = [
  {
    type: "function",
    name: "get_calendar",
    description: "Get the user's upcoming calendar events",
    parameters: {
      type: "object",
      properties: {
        time_range: {
          type: "string",
          description: "Time range to check, e.g. 'next 2 hours', 'today', 'this week'"
        }
      },
      required: ["time_range"]
    }
  },
  {
    type: "function", 
    name: "update_crm",
    description: "Update a CRM record with meeting notes or next steps",
    parameters: {
      type: "object",
      properties: {
        contact_name: { type: "string" },
        meeting_date: { type: "string" },
        notes: { type: "string" },
        next_steps: { type: "string" }
      },
      required: ["contact_name", "notes"]
    }
  }
];

In the demo, the agent reads that there’s a meeting with Sable Crust Robotics in 12 minutes and the contact is Alex Kim, their CTO. Then when asked to update the CRM, it pulls context (warehouse automation launch, expansion active, security review as the blocker) and writes the record. That context-pulling before writing is the parallel tool calling in action — it’s not just writing what you told it, it’s gathering relevant context first.

Now you have tool definitions that mirror the demo scenario.

Step 6: Handle tool call events and audio output

The Realtime API sends events for everything: session creation, speech detection, transcription, tool calls, audio deltas. The ones you need to handle for a basic agent:

ws.on("message", (data) => {
  const event = JSON.parse(data);
  
  switch(event.type) {
    case "response.function_call_arguments.done":
      // Model wants to call a tool
      const result = handleToolCall(event.name, JSON.parse(event.arguments));
      ws.send(JSON.stringify({
        type: "conversation.item.create",
        item: {
          type: "function_call_output",
          call_id: event.call_id,
          output: JSON.stringify(result)
        }
      }));
      ws.send(JSON.stringify({ type: "response.create" }));
      break;
      
    case "response.audio.delta":
      // Audio chunk to play to the user
      playAudioChunk(Buffer.from(event.delta, "base64"));
      break;
      
    case "input_audio_buffer.speech_started":
      // User started speaking — stop current audio playback
      stopAudioPlayback();
      break;
  }
});

The input_audio_buffer.speech_started event is how interruption handling works. When the user starts talking, you stop playing the model’s current audio output. The model handles the conversational state — you just need to stop the audio on your end.

Now you have a complete event loop that handles tool calls, audio output, and interruptions.

The Failure Modes You’ll Actually Hit

Audio format mismatch. The Realtime API expects PCM16 audio at 24kHz by default. If you’re capturing audio from a browser or microphone at a different sample rate, you’ll get garbled output or silence. Resample before sending.

Missing preamble causing perceived hangs. If the model starts a tool call without saying anything first, users will experience 2–4 seconds of silence while the tool runs. This feels broken. The system prompt preamble pattern fixes this — the model says “let me check that” before calling the tool, so the silence has context.

Turn detection sensitivity. The server_vad turn detection has a threshold and silence duration setting. Too sensitive and it interrupts itself mid-sentence. Not sensitive enough and it waits too long after you stop speaking. The defaults (threshold: 0.5, silence: 500ms) are reasonable starting points but you’ll tune these for your specific use case and microphone setup.

WebSocket connection drops. Long-running sessions will occasionally drop. Build reconnection logic from the start — don’t wait until you’re in a demo and the connection silently dies.

Cost surprises. Audio tokens are priced per second of audio, not per token in the text sense. A 10-minute voice session costs meaningfully more than a 10-minute text session. Monitor your usage dashboard during development.

If you’re building agents that need to stay running continuously, the infrastructure patterns in keeping a Claude Code agent running 24/7 apply here too — process management, reconnection handling, and monitoring matter for voice agents just as much as for coding agents.

Where to Take This Further

Add real tool integrations. The demo uses a calendar and CRM. Both are straightforward to wire up — Google Calendar has a well-documented REST API, and most CRMs (Salesforce, HubSpot) have webhooks and REST endpoints. The tool definition schema is the same regardless of what’s behind it.

Build the silent-listening pattern properly. The demo shows the model going silent on command and resuming on a keyword (“back to demo”). Implement this as a state machine in your tool handler: a set_listening_mode tool that takes a boolean, with the system prompt instructing the model to check this state before speaking. This is more reliable than hoping the model infers silence from conversational context alone.

Add GPT Realtime Whisper for transcription. If you need a written record of voice sessions — for compliance, for CRM notes, for meeting summaries — GPT Realtime Whisper streams transcription as the speaker talks. You can run it in parallel with GPT Realtime 2 to get both the conversational agent and a transcript simultaneously.

Consider the orchestration layer. Once you have more than two or three tools, managing the agent’s decision-making about which tools to call and in what order becomes its own problem. MindStudio handles this orchestration across 200+ models and 1,000+ integrations with a visual builder — useful if you want to chain your voice agent into a broader workflow without writing all the routing logic yourself.

Think about the spec before the code. If you’re building a production voice agent — not just a demo — the system prompt is effectively a specification document. It defines behavior, edge cases, and failure modes. Remy takes this idea further: you write your application as an annotated markdown spec, and it compiles a complete TypeScript app — backend, database, auth, and deployment — from that spec. For voice agents that need a backend to store conversation history or CRM updates, that’s a meaningfully different way to think about the build process.

Test the interruption handling under realistic conditions. The playground demo makes interruption look seamless. In practice, network latency, audio buffer sizes, and microphone sensitivity all affect how gracefully interruptions work. Test with real users in real environments before you ship.

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY

✓Designed the data model

✓Picked an auth scheme — sessions + RBAC

✓Wired up Stripe checkout

✓Deployed to production

Live at yourapp.msagent.ai

The calendar/CRM scenario in the demo is a good template precisely because it’s concrete. It has a specific trigger (upcoming meeting), a specific action (CRM update), and a specific social constraint (the agent needs to know when to shut up). Most real voice agent use cases have the same three components. Figure out yours, and the rest is implementation.

One thing the demo makes clear that benchmarks don’t: the quality of a voice agent isn’t primarily about model capability. It’s about the interaction design — when the agent speaks, when it listens, how it signals that it’s working, and how it handles the moments when the human needs to take back control. GPT Realtime 2 gives you the primitives. What you build with them is still up to you.

For building agents that handle more complex multi-step reasoning and need persistent memory across sessions, the patterns in building a self-evolving memory system with Claude Code hooks translate well — the core idea of capturing session context and making it available to future sessions applies to voice agents too. And if you’re evaluating which underlying model to use for the reasoning layer of your agent, GPT-5.4 vs Claude Opus 4.6 breaks down the tradeoffs in detail.

The API is available now. The demo is at platform.openai.com/audio/realtime. Your API credits are the only thing standing between you and a working session.