Skip to main content
MindStudio
Pricing
Blog About
My Workspace

How to Build a Live Translation Voice Agent with OpenAI's GPT Realtime API

GPT Realtime Translate supports 70+ input languages with real-time speech translation. Learn how to build a live translation agent using the API.

MindStudio Team RSS
How to Build a Live Translation Voice Agent with OpenAI's GPT Realtime API

Real-Time Voice Translation Is Now Accessible — Here’s What Changed

Language barriers slow down business. Whether you’re running a multilingual support team, building a global customer service tool, or coordinating across international offices, the gap between “someone speaks” and “someone understands” has always required either a human interpreter or a clunky, asynchronous translation layer.

OpenAI’s GPT Realtime API changes that. It enables genuinely low-latency, speech-to-speech communication — the kind where someone speaks in Mandarin and your agent responds in English almost instantly. No transcription step first. No round-trip to a separate TTS service. Just voice in, voice out, with GPT doing the heavy lifting in the middle.

This guide walks through what the Realtime API actually is, how to use it to build a live translation voice agent, and how to handle the configuration details that make the difference between a demo and a deployable product.


What the GPT Realtime API Actually Does

The Realtime API is built around a persistent WebSocket connection. Unlike standard REST calls — where you send a request and wait for a response — a WebSocket stays open, letting audio stream in and out continuously.

This is what enables real-time interaction. You’re not uploading a file and waiting. You’re sending audio chunks as the user speaks, and receiving audio chunks back as the model responds — all within a single open connection.

The gpt-4o-realtime-preview Model

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

The core model powering the Realtime API is gpt-4o-realtime-preview. It natively processes audio input rather than first transcribing it to text. This matters for translation specifically because it preserves speaker intent, tone, and nuance that can get lost in a text-only pipeline.

The model supports over 70 input languages and can respond in a separate target language — which is exactly the functionality we’ll use for a translation agent.

What’s Different from a Standard TTS + STT Pipeline

A traditional voice translation setup looks like this:

  • Capture audio
  • Send to a speech-to-text (STT) service
  • Pass text to a translation model
  • Pass translated text to a text-to-speech (TTS) service
  • Play audio back

Each step adds latency. The total round-trip often takes 3–6 seconds — long enough to feel like a delay.

The Realtime API collapses this into a single step. The model handles input audio, reasoning, and output audio natively. The result is response times closer to 500ms–1.5s, which feels much more like a real conversation.


Prerequisites Before You Start

Before building anything, make sure you have the following:

Access and credentials:

  • An OpenAI API key with Realtime API access (currently available on paid tiers)
  • Node.js 18+ installed locally, or a server environment that supports WebSockets

Basic familiarity with:

  • JavaScript or Python (the examples here use JavaScript/Node.js)
  • WebSocket connections (you don’t need to be an expert, but understanding the request/response cycle helps)
  • Environment variable management

Optional but helpful:

  • A working microphone and speaker setup for local testing
  • A tool like Insomnia or Postman that supports WebSocket testing

If you’re building for production, you’ll also want a plan for audio streaming from client devices — browser-based via the Web Audio API or from a phone via a telephony bridge like Twilio.


Step 1 — Connect to the Realtime API

The Realtime API endpoint is:

wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview

You authenticate via the Authorization header, passed during the WebSocket handshake.

Here’s a minimal Node.js connection using the ws package:

import WebSocket from "ws";

const ws = new WebSocket(
  "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview",
  {
    headers: {
      Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      "OpenAI-Beta": "realtime=v1",
    },
  }
);

ws.on("open", () => {
  console.log("Connected to Realtime API");
});

ws.on("message", (data) => {
  const event = JSON.parse(data);
  console.log("Received:", event.type);
});

Once connected, you’ll receive a session.created event confirming the session is live.


Step 2 — Configure the Session for Translation

This is where you define how your agent behaves. Send a session.update event immediately after connection to configure the translation logic.

ws.on("open", () => {
  const sessionConfig = {
    type: "session.update",
    session: {
      modalities: ["audio", "text"],
      instructions: `You are a live interpreter. The user will speak in any language. 
        Listen to what they say, understand it fully, and respond with a fluent, 
        natural translation in English only. Do not explain what you're doing. 
        Do not add commentary. Simply speak the translation.`,
      voice: "alloy",
      input_audio_format: "pcm16",
      output_audio_format: "pcm16",
      input_audio_transcription: {
        model: "whisper-1",
      },
      turn_detection: {
        type: "server_vad",
        threshold: 0.5,
        prefix_padding_ms: 300,
        silence_duration_ms: 500,
      },
    },
  };

  ws.send(JSON.stringify(sessionConfig));
});

A few key parameters to understand:

instructions — This is your system prompt. For a translation agent, the instructions define the source-to-target language mapping. You can make this dynamic (see Step 5).

voice — The voice for audio output. Options include alloy, echo, fable, onyx, nova, and shimmer.

turn_detection — This tells the API how to detect when the user has stopped speaking. server_vad (Voice Activity Detection) handles this automatically. The silence_duration_ms value determines how long a pause triggers the model to respond — 500ms works well for translation use cases.

input_audio_transcription — Optional, but enabling this gives you a text transcript of what was said, which is useful for logging and debugging.


Step 3 — Stream Audio Input

The Realtime API expects audio in PCM16 format at 24kHz, mono. You send audio chunks as base64-encoded strings using the input_audio_buffer.append event.

function sendAudioChunk(audioBuffer) {
  const base64Audio = audioBuffer.toString("base64");
  ws.send(
    JSON.stringify({
      type: "input_audio_buffer.append",
      audio: base64Audio,
    })
  );
}

In a browser context, you’d capture audio from the microphone using the Web Audio API, convert it to PCM16 at 24kHz, and pipe it through this function.

In a telephony integration (e.g., a Twilio call), you’d stream the incoming call audio — usually in μ-law (mulaw) format — and transcode it to PCM16 before sending.

Handling the Audio Buffer

If you’re using server_vad, you don’t need to manually commit the buffer — the server will detect speech boundaries and trigger a response automatically. If you disable VAD and want manual control, you can send input_audio_buffer.commit to indicate the user has finished speaking.


Step 4 — Receive and Play the Translation

The API streams audio back in chunks via the response.audio.delta event. Each delta contains a base64-encoded audio segment.

ws.on("message", (data) => {
  const event = JSON.parse(data);

  if (event.type === "response.audio.delta") {
    const audioChunk = Buffer.from(event.delta, "base64");
    // Pass to your audio playback pipeline
    playAudio(audioChunk);
  }

  if (event.type === "response.audio.done") {
    console.log("Translation complete");
  }

  if (event.type === "conversation.item.input_audio_transcription.completed") {
    console.log("User said:", event.transcript);
  }
});

For browser playback, you’d push chunks into an AudioContext buffer and schedule them for playback. For telephony, you’d write the audio back to the call stream.

Events Worth Monitoring

EventWhat It Means
session.createdConnection confirmed
session.updatedYour config was accepted
input_audio_buffer.speech_startedVAD detected the user is speaking
input_audio_buffer.speech_stoppedUser stopped speaking
response.createdModel has started generating a response
response.audio.deltaAudio chunk ready to play
response.audio.doneFull audio response complete
response.doneEntire response (text + audio) is done
errorSomething went wrong — check error.message

Step 5 — Make Translation Direction Dynamic

A static system prompt locks you into one language pair. For a real product, you want to swap translation direction based on context — a user preference, a language detected at session start, or a setting in your UI.

The cleanest approach is to update the session instructions before each conversation turn. You can send a new session.update event at any point during the session:

function setTranslationDirection(sourceLang, targetLang) {
  ws.send(
    JSON.stringify({
      type: "session.update",
      session: {
        instructions: `You are a live interpreter. The user will speak in ${sourceLang}. 
          Translate everything they say into fluent, natural ${targetLang}. 
          Speak only the translation — no explanations or additions.`,
      },
    })
  );
}
RWORK ORDER · NO. 0001ACCEPTED 09:42
YOU ASKED FOR
Sales CRM with pipeline view and email integration.
✓ DONE
REMY DELIVERED
Same day.
yourapp.msagent.ai
AGENTS ASSIGNEDDesign · Engineering · QA · Deploy

Call this function based on user selection in your UI before they start speaking. You can also trigger it mid-session if the language direction changes.

Supporting Bidirectional Translation

For a two-way conversation — say, a doctor speaking English and a patient speaking Spanish — you need to detect which speaker is talking and switch the translation direction accordingly.

One approach: use separate audio inputs for each participant, with their own session configurations pointing to each other’s output. A simpler approach for single-device use: add a UI toggle the facilitator flips when switching speakers.


Step 6 — Handle Interruptions and Edge Cases

Real conversations aren’t clean. People talk over each other, pause mid-sentence, or change what they’re saying. The Realtime API has a few tools for handling this gracefully.

Canceling an In-Progress Response

If the user starts speaking while the model is still outputting audio, you can cancel the current response:

ws.send(
  JSON.stringify({
    type: "response.cancel",
  })
);

If you’re using VAD, this is detected automatically — the model will stop output when it hears the user speaking again.

Truncating Audio

If you played some audio but want to signal that the user “interrupted” at a specific point, use conversation.item.truncate to tell the API how much audio was actually heard. This keeps the conversation history accurate.

Error Recovery

The API will send an error event for things like invalid audio format, rate limits, or content policy violations. Always include error handling:

if (event.type === "error") {
  console.error("Realtime API error:", event.error.code, event.error.message);
  // Reconnect logic, user notification, etc.
}

For production deployments, add reconnection logic — WebSocket connections can drop, and your application should handle that without losing the conversation context.


Step 7 — Test Across Languages

Once your agent is running, test it systematically before deploying. The model handles most languages well, but there are a few things worth verifying:

  • Code-switching: What happens when a speaker mixes two languages mid-sentence? This is common in bilingual communities and the model handles it, but you should test for your specific use case.
  • Accented speech: Test with speakers who have different accents in the source language. Whisper transcription is generally robust, but edge cases exist.
  • Technical terminology: For specialized domains (medical, legal, technical), the translation may default to general vocabulary. You can address this by expanding the system prompt to specify domain context.
  • Response latency by language: Some language pairs have slightly more latency than others. Measure this during testing and set user expectations accordingly.

OpenAI’s documentation on the Realtime API includes a full list of supported languages and known limitations.


Where MindStudio Fits Into This

Building the WebSocket layer yourself gives you full control — but it also means you’re managing infrastructure, handling reconnections, logging conversation transcripts, and building out any downstream actions manually.

If you want to take a translated conversation and do something with it — log it to Airtable, send a summary to Slack, trigger a CRM update, or route it to a support ticket — that’s where MindStudio becomes useful.

MindStudio is a no-code platform for building AI agents and workflows. It has direct access to 200+ AI models (including OpenAI’s latest) and 1,000+ integrations with business tools, without you needing to manage API keys or write glue code.

For a translation use case, you could build a MindStudio workflow that:

  • Takes an incoming audio file or transcript
  • Routes it through a translation step
  • Formats the output for your target system (a ticket, a summary, a CRM note)
  • Sends it to the right destination automatically

It won’t replace the real-time WebSocket layer for live speech-to-speech — but for the before and after steps, it saves a significant amount of backend work. You can try MindStudio free at mindstudio.ai and connect it to your existing pipeline.

If you’re looking to build broader AI-powered workflows beyond translation, the MindStudio workflow builder also supports multi-step agents that reason across tasks — useful if translation is one step in a larger process.


Common Mistakes to Avoid

Sending audio in the wrong format — PCM16 at 24kHz mono is required. Sending MP3, WAV with headers, or stereo audio will cause errors or garbled output. Convert before sending.

Overly complex system prompts — For translation agents, simple instructions outperform elaborate ones. The model doesn’t need a backstory. It needs clear, direct instructions about what to translate and how to respond.

Not handling input_audio_buffer.speech_stopped events — If you’re building a UI, you should update the interface when the user stops speaking (e.g., show a “translating…” indicator). Ignoring this event leaves users uncertain whether their input was captured.

Neglecting latency on slow connections — The Realtime API requires a stable connection. On slow or unreliable networks, audio chunks may arrive out of order or with gaps. Add a small buffer on the playback side to smooth this out.

Hardcoding the target language — If you’re building for real users, you’ll almost certainly need dynamic language selection. Build it from the start rather than retrofitting it later.


Frequently Asked Questions

What languages does the GPT Realtime API support for translation?

The gpt-4o-realtime-preview model supports over 70 input languages. For output (spoken translation), it performs best with major world languages including English, Spanish, French, German, Portuguese, Italian, Japanese, Korean, Chinese (Mandarin), and Arabic, among others. Support quality varies — high-resource languages with more training data generally produce more accurate translations.

How much does the GPT Realtime API cost?

Pricing for the Realtime API is based on audio tokens. As of 2024, input audio is priced at $100 per 1M tokens and output audio at $200 per 1M tokens (approximately). Text tokens are cheaper. Costs scale with usage, so estimate carefully for high-volume applications. Check OpenAI’s pricing page for current rates.

Can I use the Realtime API directly from a browser?

Yes, but you should not expose your API key in client-side code. The recommended pattern is to have your backend establish the WebSocket connection and proxy audio between the client and OpenAI. Some implementations use an ephemeral token approach — generate a short-lived token server-side and pass it to the client for direct connection.

How do I handle bidirectional translation between two speakers?

TIME SPENT BUILDING REAL SOFTWARE
5%
95%
5% Typing the code
95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

For bidirectional translation, you have a few options. The simplest: use a toggle in the UI that switches the session instructions when speaker roles change. More robust: run two separate sessions — one for each language direction — and route audio to the appropriate session based on which speaker is active. For telephony integrations, you can use separate audio streams per speaker leg.

What’s the difference between the Realtime API and using Whisper + GPT-4 + TTS separately?

A chained pipeline (Whisper → GPT-4 → TTS) is more flexible and supports asynchronous use cases, but it adds latency at each step. Total round-trip is typically 3–6+ seconds. The Realtime API compresses this to ~500ms–1.5s by handling everything natively. The tradeoff is that the Realtime API currently offers less fine-grained control over each step and costs more per token.

Can I log transcripts from Realtime API sessions?

Yes. When you enable input_audio_transcription in your session config (using whisper-1), you’ll receive conversation.item.input_audio_transcription.completed events with text transcripts of what was spoken. You can also log response text from response.text.delta events. Store these to a database or logging service of your choice for audit trails or analytics.


Key Takeaways

  • The GPT Realtime API uses a persistent WebSocket connection to enable low-latency, speech-to-speech translation with latency as low as 500ms.
  • The core model, gpt-4o-realtime-preview, natively processes audio — no separate STT, translation, and TTS steps required.
  • Session configuration (system prompt, voice, VAD settings, audio format) drives agent behavior. Keep translation instructions direct and simple.
  • Dynamic language selection, bidirectional translation, and interruption handling are all supported — they just require deliberate configuration.
  • For downstream workflow steps — logging, routing, CRM updates — platforms like MindStudio can handle the integration layer without custom backend code.

The gap between “this works in a demo” and “this works reliably in production” comes down to error handling, audio format consistency, and latency management. Get those right and you have a genuinely useful product.

Presented by MindStudio

No spam. Unsubscribe anytime.