How to Build a Voice Agent with Real-Time Translation Using OpenAI GPT Realtime 2

What Makes Real-Time Voice Translation Different From Everything Before It

Language barriers cost businesses real money. Missed deals, frustrated customers, slow support — the problem is well-documented. And while machine translation has existed for decades, the pipeline has always been clunky: speak, wait, transcribe, translate, wait again, hear a response. The latency alone makes conversations feel broken.

GPT Realtime 2 changes the equation. OpenAI’s Realtime API now supports speech-to-speech processing across more than 70 languages, with low enough latency to feel like a natural conversation. No chunked transcription. No separate translation step. Audio goes in, translated audio comes out — in near real time.

This guide walks through how to build a real-time translation voice agent using GPT Realtime 2, from WebSocket setup to turn detection to handling multilingual routing. Whether you’re building a customer support tool, a conference interpreter, or a multilingual sales assistant, the same core architecture applies.

Understanding the GPT Realtime 2 API

OpenAI’s Realtime API is built around a persistent WebSocket connection. Unlike the standard Chat Completions API, where you send a request and wait for a full response, the Realtime API streams audio bidirectionally in real time.

The model being used here is gpt-4o-realtime-preview — OpenAI’s most capable real-time model at the time of writing. The “Realtime 2” designation refers to the updated 2024/2025 version of this API, which introduced expanded language support, improved turn detection, and more stable multilingual output.

How the audio pipeline works

The basic flow looks like this:

Your app captures audio from a microphone
That audio is encoded (typically PCM16 at 24kHz) and sent over WebSocket as base64 chunks
The model processes the incoming audio stream in real time
The model responds with audio output (also streamed back as base64)
Your app decodes and plays that audio

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

For a translation agent specifically, you configure the model’s instructions to receive input in one language and respond in another. The model handles both the comprehension and the generation — there’s no separate translation layer.

Session configuration

Before audio flows, you establish a session with a configuration object. This is where you set the model’s behavior:

{
  "type": "session.update",
  "session": {
    "modalities": ["text", "audio"],
    "instructions": "You are a real-time interpreter. The user will speak in Spanish. Translate everything they say into English and respond only in English. Do not add commentary. Translate only.",
    "voice": "alloy",
    "input_audio_format": "pcm16",
    "output_audio_format": "pcm16",
    "input_audio_transcription": {
      "model": "whisper-1"
    },
    "turn_detection": {
      "type": "server_vad",
      "threshold": 0.5,
      "prefix_padding_ms": 300,
      "silence_duration_ms": 500
    }
  }
}

The instructions field is where your translation logic lives. You specify the source and target languages, the persona, and any behavioral constraints.

Prerequisites Before You Build

Before writing a line of code, make sure you have the following in place.

Access and credentials:

An OpenAI account with Realtime API access (currently requires a paid tier)
An API key stored as an environment variable (OPENAI_API_KEY)
Node.js 18+ or Python 3.10+ (this guide uses Node.js for the WebSocket layer)

Understanding of the audio formats:

The Realtime API uses PCM16 audio at 24kHz sample rate
Input and output are sent as base64-encoded chunks
Most browser MediaRecorder implementations will need conversion

Familiarity with WebSockets:

The connection URL is wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview
Auth is handled via a header: Authorization: Bearer YOUR_API_KEY

If you’re building for the browser, OpenAI provides an ephemeral token mechanism so you don’t expose your API key on the client side. We’ll cover that in the architecture section.

Step-by-Step: Building the Translation Voice Agent

Step 1 — Set up the project structure

Create a new Node.js project and install dependencies:

mkdir voice-translator
cd voice-translator
npm init -y
npm install ws dotenv express

Your project structure should look like this:

voice-translator/
├── server.js          # WebSocket proxy + session management
├── public/
│   ├── index.html     # Client UI
│   └── audio.js       # Browser audio capture and playback
├── .env
└── package.json

Step 2 — Build the server-side WebSocket proxy

You don’t want your OpenAI API key exposed in the browser. The standard pattern is to run a lightweight server that creates a session token (or proxies the WebSocket connection) on behalf of the client.

// server.js
import express from 'express';
import { WebSocketServer, WebSocket } from 'ws';
import dotenv from 'dotenv';
import { createServer } from 'http';

dotenv.config();

const app = express();
app.use(express.static('public'));
app.use(express.json());

// Ephemeral token endpoint
app.post('/session', async (req, res) => {
  const { sourceLang, targetLang } = req.body;
  
  const response = await fetch('https://api.openai.com/v1/realtime/sessions', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model: 'gpt-4o-realtime-preview',
      voice: 'alloy',
      instructions: buildTranslationPrompt(sourceLang, targetLang)
    })
  });

  const data = await response.json();
  res.json({ client_secret: data.client_secret });
});

function buildTranslationPrompt(sourceLang, targetLang) {
  return `You are a professional real-time interpreter.
The user will speak in ${sourceLang}.
Translate everything they say into ${targetLang} accurately and immediately.
Preserve the speaker's tone and intent.
Do not add explanations, caveats, or commentary.
Translate only. Respond only in ${targetLang}.`;
}

const server = createServer(app);
server.listen(3000, () => console.log('Server running on port 3000'));

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

Step 3 — Handle browser audio capture

The browser’s MediaRecorder API doesn’t output PCM16 by default. You’ll need to work with the Web Audio API to capture raw PCM data.

// public/audio.js

class AudioCapture {
  constructor() {
    this.audioContext = null;
    this.mediaStream = null;
    this.processor = null;
    this.onAudioData = null;
  }

  async start() {
    this.audioContext = new AudioContext({ sampleRate: 24000 });
    this.mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
    
    const source = this.audioContext.createMediaStreamSource(this.mediaStream);
    this.processor = this.audioContext.createScriptProcessor(4096, 1, 1);
    
    this.processor.onaudioprocess = (e) => {
      const inputData = e.inputBuffer.getChannelData(0);
      const pcm16 = this.floatToPCM16(inputData);
      const base64 = this.arrayBufferToBase64(pcm16.buffer);
      
      if (this.onAudioData) {
        this.onAudioData(base64);
      }
    };

    source.connect(this.processor);
    this.processor.connect(this.audioContext.destination);
  }

  floatToPCM16(floatArray) {
    const pcm = new Int16Array(floatArray.length);
    for (let i = 0; i < floatArray.length; i++) {
      const s = Math.max(-1, Math.min(1, floatArray[i]));
      pcm[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
    }
    return pcm;
  }

  arrayBufferToBase64(buffer) {
    const bytes = new Uint8Array(buffer);
    let binary = '';
    for (let i = 0; i < bytes.length; i++) {
      binary += String.fromCharCode(bytes[i]);
    }
    return btoa(binary);
  }

  stop() {
    this.processor?.disconnect();
    this.mediaStream?.getTracks().forEach(t => t.stop());
    this.audioContext?.close();
  }
}

Step 4 — Connect to the Realtime API from the browser

With a session token in hand, the browser can connect directly to OpenAI’s WebSocket endpoint:

// public/audio.js (continued)

class TranslationAgent {
  constructor() {
    this.ws = null;
    this.audioCapture = new AudioCapture();
    this.audioQueue = [];
    this.isPlaying = false;
  }

  async connect(sourceLang, targetLang) {
    // Get ephemeral token from your server
    const { client_secret } = await fetch('/session', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ sourceLang, targetLang })
    }).then(r => r.json());

    this.ws = new WebSocket(
      `wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview`,
      ['realtime', `openai-insecure-api-key.${client_secret.value}`, 'openai-beta.realtime-v1']
    );

    this.ws.onopen = () => {
      console.log('Connected to GPT Realtime');
      this.startAudio();
    };

    this.ws.onmessage = (event) => {
      const message = JSON.parse(event.data);
      this.handleMessage(message);
    };

    this.ws.onerror = (err) => console.error('WebSocket error:', err);
    this.ws.onclose = () => console.log('Connection closed');
  }

  async startAudio() {
    this.audioCapture.onAudioData = (base64) => {
      if (this.ws?.readyState === WebSocket.OPEN) {
        this.ws.send(JSON.stringify({
          type: 'input_audio_buffer.append',
          audio: base64
        }));
      }
    };
    await this.audioCapture.start();
  }

  handleMessage(message) {
    switch (message.type) {
      case 'response.audio.delta':
        this.queueAudio(message.delta);
        break;
      case 'response.audio.done':
        console.log('Response complete');
        break;
      case 'conversation.item.input_audio_transcription.completed':
        console.log('Transcribed:', message.transcript);
        break;
      case 'error':
        console.error('API error:', message.error);
        break;
    }
  }

  queueAudio(base64Delta) {
    // Decode and queue for playback
    const binary = atob(base64Delta);
    const buffer = new ArrayBuffer(binary.length);
    const view = new Uint8Array(buffer);
    for (let i = 0; i < binary.length; i++) {
      view[i] = binary.charCodeAt(i);
    }
    this.audioQueue.push(buffer);
    if (!this.isPlaying) this.playNext();
  }

  async playNext() {
    if (this.audioQueue.length === 0) {
      this.isPlaying = false;
      return;
    }
    this.isPlaying = true;
    const audioCtx = new AudioContext({ sampleRate: 24000 });
    const buffer = this.audioQueue.shift();
    
    // Convert PCM16 to float32 for Web Audio API
    const pcm16 = new Int16Array(buffer);
    const float32 = new Float32Array(pcm16.length);
    for (let i = 0; i < pcm16.length; i++) {
      float32[i] = pcm16[i] / 32768;
    }
    
    const audioBuffer = audioCtx.createBuffer(1, float32.length, 24000);
    audioBuffer.copyToChannel(float32, 0);
    
    const source = audioCtx.createBufferSource();
    source.buffer = audioBuffer;
    source.connect(audioCtx.destination);
    source.onended = () => this.playNext();
    source.start();
  }
}

Step 5 — Build the language selector UI

A minimal HTML interface lets users choose the source and target languages before starting a session:

<!-- public/index.html -->
<!DOCTYPE html>
<html>
<head>
  <title>Real-Time Translator</title>
</head>
<body>
  <h2>Voice Translator</h2>
  
  <label>Speaking in:</label>
  <select id="sourceLang">
    <option value="Spanish">Spanish</option>
    <option value="French">French</option>
    <option value="German">German</option>
    <option value="Japanese">Japanese</option>
    <option value="Mandarin Chinese">Mandarin Chinese</option>
    <option value="Arabic">Arabic</option>
    <option value="Portuguese">Portuguese</option>
  </select>

  <label>Translate to:</label>
  <select id="targetLang">
    <option value="English">English</option>
    <option value="Spanish">Spanish</option>
    <option value="French">French</option>
    <option value="German">German</option>
  </select>

  <button id="startBtn">Start Translation</button>
  <button id="stopBtn" disabled>Stop</button>

  <div id="status"></div>
  <div id="transcript"></div>

  <script src="audio.js"></script>
  <script>
    const agent = new TranslationAgent();
    
    document.getElementById('startBtn').onclick = async () => {
      const source = document.getElementById('sourceLang').value;
      const target = document.getElementById('targetLang').value;
      await agent.connect(source, target);
      document.getElementById('startBtn').disabled = true;
      document.getElementById('stopBtn').disabled = false;
      document.getElementById('status').textContent = `Translating ${source} → ${target}`;
    };

    document.getElementById('stopBtn').onclick = () => {
      agent.audioCapture.stop();
      agent.ws?.close();
      document.getElementById('startBtn').disabled = false;
      document.getElementById('stopBtn').disabled = true;
    };
  </script>
</body>
</html>

Handling Turn Detection and Interruptions

One of the strongest features of the GPT Realtime 2 API is server-side Voice Activity Detection (VAD). The model automatically detects when a user starts and stops speaking — no manual button-press required.

How server VAD works

With turn_detection.type set to server_vad, the API:

Monitors incoming audio for speech activity
Automatically commits the audio buffer when silence is detected
Triggers a response without you needing to send a input_audio_buffer.commit event
Handles natural pauses without cutting off mid-sentence

The key parameters to tune:

Parameter	Default	Effect
`threshold`	0.5	Sensitivity to speech (0–1). Lower = more sensitive
`prefix_padding_ms`	300	Audio kept before speech starts
`silence_duration_ms`	500	How long silence must last before committing

For translation use cases, a slightly higher silence_duration_ms (600–800ms) tends to work better — speakers often pause mid-sentence across languages, and you don’t want the model cutting off an incomplete thought.

Handling interruptions

The Realtime API supports mid-response interruptions. If a user speaks while the model is generating audio, you can detect this and cancel the current response:

// Listen for when user starts speaking again
if (message.type === 'input_audio_buffer.speech_started') {
  // Cancel any in-progress response
  ws.send(JSON.stringify({ type: 'response.cancel' }));
  // Clear the audio queue
  this.audioQueue = [];
}

This creates a much more natural conversation flow — the translated voice doesn’t keep talking when the original speaker interjects.

Supporting Multiple Language Pairs Dynamically

A production translation agent usually needs to handle more than one fixed language pair. There are two patterns for this.

The simplest approach: when a user switches languages, close the WebSocket and open a fresh session with new instructions. Sessions are cheap to create, and this approach keeps the model’s context clean.

This works well for call center or support scenarios where each caller speaks one language throughout the interaction.

Pattern 2 — Auto-detect with dynamic instructions

For scenarios where the input language is unknown (a live conference, for example), you can prompt the model to detect and translate:

instructions: `You are a professional interpreter.
Listen to what the user says and automatically detect the language.
Translate everything into English.
If the user speaks English, repeat back clearly.
Never respond in anything other than English.`

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

This works surprisingly well for major world languages. The model handles detection implicitly — you don’t need a separate language identification step.

Language coverage

GPT-4o’s multilingual capabilities span over 50 supported languages, with the best performance on:

European languages: Spanish, French, German, Italian, Portuguese, Dutch, Polish
East Asian languages: Japanese, Korean, Mandarin Chinese
Middle Eastern: Arabic, Hebrew, Turkish
South Asian: Hindi, Bengali, Urdu

Performance degrades for lower-resource languages. For critical applications, test your specific language pair before deploying.

Common Issues and How to Fix Them

Building with real-time audio APIs surfaces a specific set of problems. Here’s what to watch for.

Echo and feedback loops: If your output audio plays through speakers near the microphone, the model will hear its own voice and try to translate it. Always use headphones during development, and implement acoustic echo cancellation in production (available via getUserMedia constraints).

Audio buffer underruns: Choppy playback often means you’re not receiving audio chunks fast enough to play them continuously. Buffer at least 300–500ms of audio before starting playback.

Transcription mismatch: If input_audio_transcription is enabled, the transcription happens separately from the translation. The transcript reflects what was actually spoken (in the source language), while the audio response is the translation. This is expected behavior.

Session expiration: Realtime sessions have a maximum duration (typically 30 minutes for the standard tier). Implement reconnection logic with session handoff for long-running applications.

Rate limits: The Realtime API has separate rate limits from the standard API. Check your tier’s concurrent session limits before deploying at scale.

Where MindStudio Fits Into This

Building the core WebSocket loop requires code. But wrapping that translation agent in a deployable workflow — routing calls, logging transcripts, sending summaries to CRMs, triggering follow-ups — is where a lot of teams waste time.

MindStudio lets you build the surrounding automation without writing that layer from scratch. You can connect a real-time translation workflow to downstream tools — logging calls to a Google Sheet, summarizing translated conversations and sending them to Slack, or routing transcripts into HubSpot deals — using MindStudio’s visual builder and its 1,000+ integrations.

For teams building multilingual support agents, a practical pattern looks like this:

The voice agent (built using the approach in this article) handles the real-time translation
A MindStudio workflow receives the transcript webhook after each call
The workflow routes the summary to the appropriate CRM contact, sends a translated follow-up email, and logs the interaction to Airtable

MindStudio supports GPT-4o and 200+ other models out of the box, so you can test translation quality across models without managing API keys separately. The average workflow takes under an hour to build.

You can try MindStudio free at mindstudio.ai — no credit card required to start.

Frequently Asked Questions

What languages does GPT Realtime 2 support?

The GPT-4o Realtime API supports over 50 languages for speech recognition and translation, with coverage for major European, East Asian, Middle Eastern, and South Asian languages. Performance is strongest for high-resource languages like Spanish, French, German, Japanese, and Mandarin. OpenAI continues expanding language support, so check the official documentation for the current list.

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

How much latency should I expect from a real-time translation voice agent?

In practice, end-to-end latency (from finished speech to hearing the translation) typically falls between 300ms and 800ms on a stable connection. This includes audio processing, model inference, and streaming playback startup. For natural conversation, anything under 1 second is generally acceptable. Network conditions and server load affect this.

Can the GPT Realtime API handle both parties of a bilingual conversation?

Yes, but it requires some architecture work. The simplest setup is two separate WebSocket sessions — one translating in each direction — with audio routing logic to decide which session receives input at any given moment. For push-to-talk applications, this is straightforward. For fully free-flowing conversations, you’ll need robust speaker diarization or physical separation (separate microphones per speaker).

What’s the difference between the Realtime API and using Whisper + GPT-4o + TTS separately?

The chained approach (Whisper transcription → GPT-4o translation → TTS generation) works, but introduces 2–4 seconds of latency per turn because each step runs sequentially. The Realtime API processes audio end-to-end in a single model pass, which is how it achieves sub-second response times. The tradeoff is that the Realtime API is more expensive per minute than the chained approach and has less flexibility for custom post-processing.

Is the GPT Realtime API suitable for production customer-facing applications?

It depends on your scale and reliability requirements. The API is generally available and supported, but real-time WebSocket connections require more careful infrastructure planning than REST calls — you need to handle reconnects, session limits, and concurrent connection caps. For low-to-medium volume (hundreds of concurrent sessions), it’s production-ready. For high-scale deployments, plan for connection pooling and fallback logic.

How do I prevent the model from translating its own audio output?

Use acoustic echo cancellation at the audio capture level. In the browser, pass { audio: { echoCancellation: true, noiseSuppression: true } } as constraints to getUserMedia. In native apps, use the platform’s built-in AEC processing. You can also implement software-level gating — mute the microphone input while output audio is playing — as a simpler fallback.

Key Takeaways

Building a real-time translation voice agent with GPT Realtime 2 is more accessible than it’s been before, but it requires working with WebSockets, audio encoding, and session management. Here’s what matters most:

Use server-side VAD for natural turn detection — it handles pauses and interruptions better than manual commit logic
Run a server-side proxy to keep your API key off the client; ephemeral tokens are the right pattern for browser-based apps
Tune silence detection for your specific language pair — translation workflows often need slightly longer pause thresholds than general conversation
Handle interruptions explicitly — cancel in-progress responses when new speech starts for a natural feel
Separate real-time audio logic from business workflow logic — the latter is where tools like MindStudio save significant time

The core voice translation loop described here can be running in an afternoon. The work beyond that — deployment, logging, CRM integration, multilingual routing — is where teams typically spend more time than they expect.