How to Build a Voice Agent with Real-Time Translation Using OpenAI GPT Realtime 2
OpenAI GPT Realtime 2 supports live translation across 70 languages. Learn how to build a real-time translation voice agent using the API and agentic tools.
What Makes Real-Time Voice Translation Different From Everything Before It
Language barriers cost businesses real money. Missed deals, frustrated customers, slow support — the problem is well-documented. And while machine translation has existed for decades, the pipeline has always been clunky: speak, wait, transcribe, translate, wait again, hear a response. The latency alone makes conversations feel broken.
GPT Realtime 2 changes the equation. OpenAI’s Realtime API now supports speech-to-speech processing across more than 70 languages, with low enough latency to feel like a natural conversation. No chunked transcription. No separate translation step. Audio goes in, translated audio comes out — in near real time.
This guide walks through how to build a real-time translation voice agent using GPT Realtime 2, from WebSocket setup to turn detection to handling multilingual routing. Whether you’re building a customer support tool, a conference interpreter, or a multilingual sales assistant, the same core architecture applies.
Understanding the GPT Realtime 2 API
OpenAI’s Realtime API is built around a persistent WebSocket connection. Unlike the standard Chat Completions API, where you send a request and wait for a full response, the Realtime API streams audio bidirectionally in real time.
The model being used here is gpt-4o-realtime-preview — OpenAI’s most capable real-time model at the time of writing. The “Realtime 2” designation refers to the updated 2024/2025 version of this API, which introduced expanded language support, improved turn detection, and more stable multilingual output.
How the audio pipeline works
The basic flow looks like this:
- Your app captures audio from a microphone
- That audio is encoded (typically PCM16 at 24kHz) and sent over WebSocket as base64 chunks
- The model processes the incoming audio stream in real time
- The model responds with audio output (also streamed back as base64)
- Your app decodes and plays that audio
Day one: idea. Day one: app.
Not a sprint plan. Not a quarterly OKR. A finished product by end of day.
For a translation agent specifically, you configure the model’s instructions to receive input in one language and respond in another. The model handles both the comprehension and the generation — there’s no separate translation layer.
Session configuration
Before audio flows, you establish a session with a configuration object. This is where you set the model’s behavior:
{
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"instructions": "You are a real-time interpreter. The user will speak in Spanish. Translate everything they say into English and respond only in English. Do not add commentary. Translate only.",
"voice": "alloy",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_transcription": {
"model": "whisper-1"
},
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
}
}
}
The instructions field is where your translation logic lives. You specify the source and target languages, the persona, and any behavioral constraints.
Prerequisites Before You Build
Before writing a line of code, make sure you have the following in place.
Access and credentials:
- An OpenAI account with Realtime API access (currently requires a paid tier)
- An API key stored as an environment variable (
OPENAI_API_KEY) - Node.js 18+ or Python 3.10+ (this guide uses Node.js for the WebSocket layer)
Understanding of the audio formats:
- The Realtime API uses PCM16 audio at 24kHz sample rate
- Input and output are sent as base64-encoded chunks
- Most browser
MediaRecorderimplementations will need conversion
Familiarity with WebSockets:
- The connection URL is
wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview - Auth is handled via a header:
Authorization: Bearer YOUR_API_KEY
If you’re building for the browser, OpenAI provides an ephemeral token mechanism so you don’t expose your API key on the client side. We’ll cover that in the architecture section.
Step-by-Step: Building the Translation Voice Agent
Step 1 — Set up the project structure
Create a new Node.js project and install dependencies:
mkdir voice-translator
cd voice-translator
npm init -y
npm install ws dotenv express
Your project structure should look like this:
voice-translator/
├── server.js # WebSocket proxy + session management
├── public/
│ ├── index.html # Client UI
│ └── audio.js # Browser audio capture and playback
├── .env
└── package.json
Step 2 — Build the server-side WebSocket proxy
You don’t want your OpenAI API key exposed in the browser. The standard pattern is to run a lightweight server that creates a session token (or proxies the WebSocket connection) on behalf of the client.
// server.js
import express from 'express';
import { WebSocketServer, WebSocket } from 'ws';
import dotenv from 'dotenv';
import { createServer } from 'http';
dotenv.config();
const app = express();
app.use(express.static('public'));
app.use(express.json());
// Ephemeral token endpoint
app.post('/session', async (req, res) => {
const { sourceLang, targetLang } = req.body;
const response = await fetch('https://api.openai.com/v1/realtime/sessions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'gpt-4o-realtime-preview',
voice: 'alloy',
instructions: buildTranslationPrompt(sourceLang, targetLang)
})
});
const data = await response.json();
res.json({ client_secret: data.client_secret });
});
function buildTranslationPrompt(sourceLang, targetLang) {
return `You are a professional real-time interpreter.
The user will speak in ${sourceLang}.
Translate everything they say into ${targetLang} accurately and immediately.
Preserve the speaker's tone and intent.
Do not add explanations, caveats, or commentary.
Translate only. Respond only in ${targetLang}.`;
}
const server = createServer(app);
server.listen(3000, () => console.log('Server running on port 3000'));
- ✕a coding agent
- ✕no-code
- ✕vibe coding
- ✕a faster Cursor
The one that tells the coding agents what to build.
Step 3 — Handle browser audio capture
The browser’s MediaRecorder API doesn’t output PCM16 by default. You’ll need to work with the Web Audio API to capture raw PCM data.
// public/audio.js
class AudioCapture {
constructor() {
this.audioContext = null;
this.mediaStream = null;
this.processor = null;
this.onAudioData = null;
}
async start() {
this.audioContext = new AudioContext({ sampleRate: 24000 });
this.mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
const source = this.audioContext.createMediaStreamSource(this.mediaStream);
this.processor = this.audioContext.createScriptProcessor(4096, 1, 1);
this.processor.onaudioprocess = (e) => {
const inputData = e.inputBuffer.getChannelData(0);
const pcm16 = this.floatToPCM16(inputData);
const base64 = this.arrayBufferToBase64(pcm16.buffer);
if (this.onAudioData) {
this.onAudioData(base64);
}
};
source.connect(this.processor);
this.processor.connect(this.audioContext.destination);
}
floatToPCM16(floatArray) {
const pcm = new Int16Array(floatArray.length);
for (let i = 0; i < floatArray.length; i++) {
const s = Math.max(-1, Math.min(1, floatArray[i]));
pcm[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
}
return pcm;
}
arrayBufferToBase64(buffer) {
const bytes = new Uint8Array(buffer);
let binary = '';
for (let i = 0; i < bytes.length; i++) {
binary += String.fromCharCode(bytes[i]);
}
return btoa(binary);
}
stop() {
this.processor?.disconnect();
this.mediaStream?.getTracks().forEach(t => t.stop());
this.audioContext?.close();
}
}
Step 4 — Connect to the Realtime API from the browser
With a session token in hand, the browser can connect directly to OpenAI’s WebSocket endpoint:
// public/audio.js (continued)
class TranslationAgent {
constructor() {
this.ws = null;
this.audioCapture = new AudioCapture();
this.audioQueue = [];
this.isPlaying = false;
}
async connect(sourceLang, targetLang) {
// Get ephemeral token from your server
const { client_secret } = await fetch('/session', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ sourceLang, targetLang })
}).then(r => r.json());
this.ws = new WebSocket(
`wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview`,
['realtime', `openai-insecure-api-key.${client_secret.value}`, 'openai-beta.realtime-v1']
);
this.ws.onopen = () => {
console.log('Connected to GPT Realtime');
this.startAudio();
};
this.ws.onmessage = (event) => {
const message = JSON.parse(event.data);
this.handleMessage(message);
};
this.ws.onerror = (err) => console.error('WebSocket error:', err);
this.ws.onclose = () => console.log('Connection closed');
}
async startAudio() {
this.audioCapture.onAudioData = (base64) => {
if (this.ws?.readyState === WebSocket.OPEN) {
this.ws.send(JSON.stringify({
type: 'input_audio_buffer.append',
audio: base64
}));
}
};
await this.audioCapture.start();
}
handleMessage(message) {
switch (message.type) {
case 'response.audio.delta':
this.queueAudio(message.delta);
break;
case 'response.audio.done':
console.log('Response complete');
break;
case 'conversation.item.input_audio_transcription.completed':
console.log('Transcribed:', message.transcript);
break;
case 'error':
console.error('API error:', message.error);
break;
}
}
queueAudio(base64Delta) {
// Decode and queue for playback
const binary = atob(base64Delta);
const buffer = new ArrayBuffer(binary.length);
const view = new Uint8Array(buffer);
for (let i = 0; i < binary.length; i++) {
view[i] = binary.charCodeAt(i);
}
this.audioQueue.push(buffer);
if (!this.isPlaying) this.playNext();
}
async playNext() {
if (this.audioQueue.length === 0) {
this.isPlaying = false;
return;
}
this.isPlaying = true;
const audioCtx = new AudioContext({ sampleRate: 24000 });
const buffer = this.audioQueue.shift();
// Convert PCM16 to float32 for Web Audio API
const pcm16 = new Int16Array(buffer);
const float32 = new Float32Array(pcm16.length);
for (let i = 0; i < pcm16.length; i++) {
float32[i] = pcm16[i] / 32768;
}
const audioBuffer = audioCtx.createBuffer(1, float32.length, 24000);
audioBuffer.copyToChannel(float32, 0);
const source = audioCtx.createBufferSource();
source.buffer = audioBuffer;
source.connect(audioCtx.destination);
source.onended = () => this.playNext();
source.start();
}
}
Step 5 — Build the language selector UI
A minimal HTML interface lets users choose the source and target languages before starting a session:
<!-- public/index.html -->
<!DOCTYPE html>
<html>
<head>
<title>Real-Time Translator</title>
</head>
<body>
<h2>Voice Translator</h2>
<label>Speaking in:</label>
<select id="sourceLang">
<option value="Spanish">Spanish</option>
<option value="French">French</option>
<option value="German">German</option>
<option value="Japanese">Japanese</option>
<option value="Mandarin Chinese">Mandarin Chinese</option>
<option value="Arabic">Arabic</option>
<option value="Portuguese">Portuguese</option>
</select>
<label>Translate to:</label>
<select id="targetLang">
<option value="English">English</option>
<option value="Spanish">Spanish</option>
<option value="French">French</option>
<option value="German">German</option>
</select>
<button id="startBtn">Start Translation</button>
<button id="stopBtn" disabled>Stop</button>
<div id="status"></div>
<div id="transcript"></div>
<script src="audio.js"></script>
<script>
const agent = new TranslationAgent();
document.getElementById('startBtn').onclick = async () => {
const source = document.getElementById('sourceLang').value;
const target = document.getElementById('targetLang').value;
await agent.connect(source, target);
document.getElementById('startBtn').disabled = true;
document.getElementById('stopBtn').disabled = false;
document.getElementById('status').textContent = `Translating ${source} → ${target}`;
};
document.getElementById('stopBtn').onclick = () => {
agent.audioCapture.stop();
agent.ws?.close();
document.getElementById('startBtn').disabled = false;
document.getElementById('stopBtn').disabled = true;
};
</script>
</body>
</html>
Handling Turn Detection and Interruptions
One of the strongest features of the GPT Realtime 2 API is server-side Voice Activity Detection (VAD). The model automatically detects when a user starts and stops speaking — no manual button-press required.
How server VAD works
With turn_detection.type set to server_vad, the API:
- Monitors incoming audio for speech activity
- Automatically commits the audio buffer when silence is detected
- Triggers a response without you needing to send a
input_audio_buffer.commitevent - Handles natural pauses without cutting off mid-sentence
The key parameters to tune:
| Parameter | Default | Effect |
|---|---|---|
threshold | 0.5 | Sensitivity to speech (0–1). Lower = more sensitive |
prefix_padding_ms | 300 | Audio kept before speech starts |
silence_duration_ms | 500 | How long silence must last before committing |
For translation use cases, a slightly higher silence_duration_ms (600–800ms) tends to work better — speakers often pause mid-sentence across languages, and you don’t want the model cutting off an incomplete thought.
Handling interruptions
The Realtime API supports mid-response interruptions. If a user speaks while the model is generating audio, you can detect this and cancel the current response:
// Listen for when user starts speaking again
if (message.type === 'input_audio_buffer.speech_started') {
// Cancel any in-progress response
ws.send(JSON.stringify({ type: 'response.cancel' }));
// Clear the audio queue
this.audioQueue = [];
}
This creates a much more natural conversation flow — the translated voice doesn’t keep talking when the original speaker interjects.
Supporting Multiple Language Pairs Dynamically
A production translation agent usually needs to handle more than one fixed language pair. There are two patterns for this.
Pattern 1 — New session per language switch
The simplest approach: when a user switches languages, close the WebSocket and open a fresh session with new instructions. Sessions are cheap to create, and this approach keeps the model’s context clean.
This works well for call center or support scenarios where each caller speaks one language throughout the interaction.
Pattern 2 — Auto-detect with dynamic instructions
For scenarios where the input language is unknown (a live conference, for example), you can prompt the model to detect and translate:
instructions: `You are a professional interpreter.
Listen to what the user says and automatically detect the language.
Translate everything into English.
If the user speaks English, repeat back clearly.
Never respond in anything other than English.`
Other agents ship a demo. Remy ships an app.
Real backend. Real database. Real auth. Real plumbing. Remy has it all.
This works surprisingly well for major world languages. The model handles detection implicitly — you don’t need a separate language identification step.
Language coverage
GPT-4o’s multilingual capabilities span over 50 supported languages, with the best performance on:
- European languages: Spanish, French, German, Italian, Portuguese, Dutch, Polish
- East Asian languages: Japanese, Korean, Mandarin Chinese
- Middle Eastern: Arabic, Hebrew, Turkish
- South Asian: Hindi, Bengali, Urdu
Performance degrades for lower-resource languages. For critical applications, test your specific language pair before deploying.
Common Issues and How to Fix Them
Building with real-time audio APIs surfaces a specific set of problems. Here’s what to watch for.
Echo and feedback loops: If your output audio plays through speakers near the microphone, the model will hear its own voice and try to translate it. Always use headphones during development, and implement acoustic echo cancellation in production (available via getUserMedia constraints).
Audio buffer underruns: Choppy playback often means you’re not receiving audio chunks fast enough to play them continuously. Buffer at least 300–500ms of audio before starting playback.
Transcription mismatch: If input_audio_transcription is enabled, the transcription happens separately from the translation. The transcript reflects what was actually spoken (in the source language), while the audio response is the translation. This is expected behavior.
Session expiration: Realtime sessions have a maximum duration (typically 30 minutes for the standard tier). Implement reconnection logic with session handoff for long-running applications.
Rate limits: The Realtime API has separate rate limits from the standard API. Check your tier’s concurrent session limits before deploying at scale.
Where MindStudio Fits Into This
Building the core WebSocket loop requires code. But wrapping that translation agent in a deployable workflow — routing calls, logging transcripts, sending summaries to CRMs, triggering follow-ups — is where a lot of teams waste time.
MindStudio lets you build the surrounding automation without writing that layer from scratch. You can connect a real-time translation workflow to downstream tools — logging calls to a Google Sheet, summarizing translated conversations and sending them to Slack, or routing transcripts into HubSpot deals — using MindStudio’s visual builder and its 1,000+ integrations.
For teams building multilingual support agents, a practical pattern looks like this:
- The voice agent (built using the approach in this article) handles the real-time translation
- A MindStudio workflow receives the transcript webhook after each call
- The workflow routes the summary to the appropriate CRM contact, sends a translated follow-up email, and logs the interaction to Airtable
MindStudio supports GPT-4o and 200+ other models out of the box, so you can test translation quality across models without managing API keys separately. The average workflow takes under an hour to build.
You can try MindStudio free at mindstudio.ai — no credit card required to start.
Frequently Asked Questions
What languages does GPT Realtime 2 support?
The GPT-4o Realtime API supports over 50 languages for speech recognition and translation, with coverage for major European, East Asian, Middle Eastern, and South Asian languages. Performance is strongest for high-resource languages like Spanish, French, German, Japanese, and Mandarin. OpenAI continues expanding language support, so check the official documentation for the current list.
Plans first. Then code.
Remy writes the spec, manages the build, and ships the app.
How much latency should I expect from a real-time translation voice agent?
In practice, end-to-end latency (from finished speech to hearing the translation) typically falls between 300ms and 800ms on a stable connection. This includes audio processing, model inference, and streaming playback startup. For natural conversation, anything under 1 second is generally acceptable. Network conditions and server load affect this.
Can the GPT Realtime API handle both parties of a bilingual conversation?
Yes, but it requires some architecture work. The simplest setup is two separate WebSocket sessions — one translating in each direction — with audio routing logic to decide which session receives input at any given moment. For push-to-talk applications, this is straightforward. For fully free-flowing conversations, you’ll need robust speaker diarization or physical separation (separate microphones per speaker).
What’s the difference between the Realtime API and using Whisper + GPT-4o + TTS separately?
The chained approach (Whisper transcription → GPT-4o translation → TTS generation) works, but introduces 2–4 seconds of latency per turn because each step runs sequentially. The Realtime API processes audio end-to-end in a single model pass, which is how it achieves sub-second response times. The tradeoff is that the Realtime API is more expensive per minute than the chained approach and has less flexibility for custom post-processing.
Is the GPT Realtime API suitable for production customer-facing applications?
It depends on your scale and reliability requirements. The API is generally available and supported, but real-time WebSocket connections require more careful infrastructure planning than REST calls — you need to handle reconnects, session limits, and concurrent connection caps. For low-to-medium volume (hundreds of concurrent sessions), it’s production-ready. For high-scale deployments, plan for connection pooling and fallback logic.
How do I prevent the model from translating its own audio output?
Use acoustic echo cancellation at the audio capture level. In the browser, pass { audio: { echoCancellation: true, noiseSuppression: true } } as constraints to getUserMedia. In native apps, use the platform’s built-in AEC processing. You can also implement software-level gating — mute the microphone input while output audio is playing — as a simpler fallback.
Key Takeaways
Building a real-time translation voice agent with GPT Realtime 2 is more accessible than it’s been before, but it requires working with WebSockets, audio encoding, and session management. Here’s what matters most:
- Use server-side VAD for natural turn detection — it handles pauses and interruptions better than manual commit logic
- Run a server-side proxy to keep your API key off the client; ephemeral tokens are the right pattern for browser-based apps
- Tune silence detection for your specific language pair — translation workflows often need slightly longer pause thresholds than general conversation
- Handle interruptions explicitly — cancel in-progress responses when new speech starts for a natural feel
- Separate real-time audio logic from business workflow logic — the latter is where tools like MindStudio save significant time
The core voice translation loop described here can be running in an afternoon. The work beyond that — deployment, logging, CRM integration, multilingual routing — is where teams typically spend more time than they expect.