How to Build a Voice Agent with Real-Time Translation Using OpenAI's API
GPT Realtime Translate supports 70+ input languages with live speech translation. Learn how to build a multilingual voice agent using OpenAI's new API.
What Real-Time Voice Translation Actually Means Now
Building a multilingual voice agent used to require a chain of separate services: speech-to-text, a translation API, text-to-speech, and something to stitch them all together. Latency was a constant problem. Conversations felt robotic. Errors compounded at every handoff.
OpenAI’s Realtime API changes that picture significantly. By processing speech input and generating spoken output in a single model pass, it eliminates most of that multi-step friction. The result is a voice agent with real-time translation that feels close to natural conversation — not a slow, mechanical relay.
This guide covers how that API works, what you need to build a multilingual voice agent on top of it, and how to think through the architecture decisions that actually matter.
How OpenAI’s Realtime API Works
OpenAI released the Realtime API as part of the GPT-4o model family. Unlike the standard Chat Completions API, which returns text responses after a full round-trip, the Realtime API streams audio input and output over a persistent WebSocket connection.
Here’s what that means in practice:
- Low-latency interaction — The model starts responding before it finishes “hearing” the full input.
- Native audio understanding — The model processes raw audio directly, not a transcript. It can detect tone, pacing, and emotion.
- Voice Activity Detection (VAD) — Built-in turn detection so you don’t have to manually manage start/stop recording logic.
- Interruption handling — Users can speak over the agent and it will respond appropriately, like a real conversation.
Not a coding agent. A product manager.
Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.
For translation use cases, this architecture matters because it removes the transcription-then-translate-then-synthesize loop that introduced hundreds of milliseconds of lag in older approaches.
What the API Accepts and Returns
The Realtime API handles bidirectional audio — you send PCM audio chunks over the WebSocket and receive audio back. You can also send text events and receive text alongside audio, which is useful for displaying transcripts or logs.
Key event types include:
conversation.item.create— Send a message (audio or text)response.create— Trigger a model responseinput_audio_buffer.append— Stream audio chunks inresponse.audio.delta— Receive audio chunks as they generateresponse.audio_transcript.delta— Receive text transcriptions in real time
The model you’ll use is gpt-4o-realtime-preview. It supports over 70 input languages for speech recognition and can respond in the language you configure.
Architecture for a Multilingual Voice Agent
Before writing any code, it helps to map out what your agent actually needs to do. A real-time translation voice agent generally handles one of two scenarios:
Scenario A: Single-user, cross-language interface A user speaks in their native language, and the agent understands and responds in a target language (or the same language, depending on config). Think: a customer service bot that accepts input in French but always responds in English.
Scenario B: Two-way interpreter mode Two people speak different languages. The agent listens to each speaker, translates, and speaks the translation back so both parties can understand each other.
Scenario B is more complex — you need to track speakers, detect language dynamically, and manage turn-taking carefully. For most production use cases, start with Scenario A.
Core Components You Need
- Frontend audio capture — A browser-based or mobile client that records microphone input and streams it over WebSocket.
- WebSocket connection handler — A lightweight server (Node.js or Python) that manages the session with OpenAI’s Realtime API.
- Session configuration — System prompt and voice settings that define the agent’s behavior.
- Audio playback — Client-side logic to receive and play back the agent’s audio response.
- (Optional) Transcript display — A UI layer that shows what was said and what the agent responded.
Step-by-Step: Building the Voice Agent
Step 1: Set Up Your Environment
You’ll need:
- An OpenAI API key with access to
gpt-4o-realtime-preview - Node.js 18+ (for the server) or Python 3.9+
- A browser environment that supports the Web Audio API and WebSockets
Install the OpenAI SDK:
npm install openai ws
Or for Python:
pip install openai websockets
Step 2: Establish the WebSocket Session
OpenAI’s Realtime API uses a persistent WebSocket connection. You’ll open this from your server, not directly from the browser — this keeps your API key secure.
Here’s a basic Node.js server that opens the connection and proxies audio between the browser and OpenAI:
import WebSocket from "ws";
const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const openaiWs = new WebSocket(
"wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview",
{
headers: {
Authorization: `Bearer ${OPENAI_API_KEY}`,
"OpenAI-Beta": "realtime=v1",
},
}
);
openaiWs.on("open", () => {
// Configure the session
openaiWs.send(JSON.stringify({
type: "session.update",
session: {
modalities: ["text", "audio"],
instructions: "You are a real-time translator. The user will speak in any language. Translate what they say into English and respond in English.",
voice: "alloy",
input_audio_format: "pcm16",
output_audio_format: "pcm16",
input_audio_transcription: {
model: "whisper-1"
},
turn_detection: {
type: "server_vad",
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 500
}
}
}));
});
The instructions field is where you control translation behavior. This is your system prompt, and it does most of the heavy lifting.
Step 3: Write an Effective System Prompt for Translation
The system prompt is what turns a general-purpose voice model into a translation agent. Be specific about behavior.
A solid starting point for a translation assistant:
You are a real-time voice translation assistant.
When the user speaks, detect the language they are using automatically.
Translate their speech into [TARGET LANGUAGE] and speak the translation clearly.
Do not add explanations, commentary, or filler phrases — only the translation.
If the input is already in [TARGET LANGUAGE], repeat it as-is without changes.
Maintain the tone and register of the original speech (formal, casual, technical).
For two-way interpretation mode, you’d add speaker identification logic and more complex routing rules. In practice, this is where most of the product design work happens — the API itself is the easier part.
Step 4: Stream Audio from the Browser
On the client side, you’ll capture microphone audio and send it to your server, which relays it to OpenAI.
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext({ sampleRate: 24000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (event) => {
const inputData = event.inputBuffer.getChannelData(0);
const pcm16 = convertFloat32ToPCM16(inputData);
const base64Audio = btoa(String.fromCharCode(...new Uint8Array(pcm16.buffer)));
serverWs.send(JSON.stringify({
type: "input_audio_buffer.append",
audio: base64Audio
}));
};
source.connect(processor);
processor.connect(audioContext.destination);
OpenAI’s Realtime API expects 16-bit PCM audio at 24kHz, single channel. The convertFloat32ToPCM16 function converts from the browser’s native float32 format.
Step 5: Play Back the Agent’s Response
Audio comes back from the API as base64-encoded PCM chunks via response.audio.delta events. You’ll need to decode and queue these for playback:
openaiWs.on("message", (data) => {
const event = JSON.parse(data);
if (event.type === "response.audio.delta") {
const audioData = base64ToFloat32(event.delta);
playAudioChunk(audioData);
}
if (event.type === "response.audio_transcript.delta") {
// Update your UI with the transcript
transcriptDisplay.textContent += event.delta;
}
});
For smooth playback, use an audio queue and the Web Audio API’s AudioBufferSourceNode to schedule chunks in sequence without gaps.
Step 6: Handle Edge Cases and Errors
Real-time audio pipelines have predictable failure modes. Plan for these:
Connection drops — WebSocket connections can close unexpectedly. Implement reconnect logic with exponential backoff.
Audio interruptions — If a user starts speaking mid-response, send a response.cancel event and flush the audio buffer before resuming.
Language detection failures — If the model can’t detect the language (very short input, background noise), it may respond in English by default. Add a fallback instruction in your system prompt.
Rate limits — The Realtime API has separate rate limits from the standard API. Monitor usage, especially for multi-user deployments.
Supported Languages and What to Expect
The underlying speech recognition in the Realtime API is powered by Whisper, which supports over 70 languages. Language support varies in quality — European languages, Japanese, Chinese, and Korean are strong. Less commonly spoken languages may have higher error rates.
Languages with strong support include:
- English, Spanish, French, German, Italian, Portuguese
- Japanese, Mandarin Chinese, Korean
- Arabic, Hindi, Russian, Dutch, Polish
- Swedish, Norwegian, Danish, Finnish
- Turkish, Indonesian, Vietnamese, Thai
Seven tools to build an app. Or just Remy.
Editor, preview, AI agents, deploy — all in one tab. Nothing to install.
For production deployments handling specific language pairs, test your target languages thoroughly with real speakers and real audio conditions — not just clean recordings.
One important note: the model’s translation quality is tied to how well it understands the source language, not just whether it can transcribe it. Some language pairs are stronger than others, and domain-specific vocabulary (medical, legal, technical) may introduce errors even in well-supported languages.
Where MindStudio Fits Into This Stack
Building the core WebSocket integration is the technical foundation, but shipping a real product usually means wrapping that core in a broader workflow: logging conversations, routing users to human agents, connecting to CRMs, sending follow-up emails, or triggering actions based on what was said.
That’s where MindStudio becomes useful. MindStudio is a no-code platform for building AI agents and automated workflows, with native access to 200+ AI models — including OpenAI’s GPT-4o family — without needing to manage API keys or separate accounts.
If you’re building a voice translation agent for a business context, you can use MindStudio to build the surrounding intelligence layer:
- Post-call processing — After a translated conversation ends, pipe the transcript into a MindStudio workflow that summarizes the call, extracts action items, and logs them to Salesforce or HubSpot.
- Multi-step routing — Build agents that detect the topic of the conversation and route to different workflows (billing issue vs. technical support vs. sales inquiry).
- Notification and follow-up — Automatically send translated conversation summaries to the right team members via Slack or email.
- Prototype the whole agent fast — MindStudio’s visual builder lets you prototype and test voice-enabled AI workflows in a fraction of the time it takes to wire everything up from scratch.
The MindStudio workflow builder supports custom JavaScript and Python functions, so you can drop in your own Realtime API integration and connect it to the 1,000+ pre-built integrations for business tools — without rebuilding the plumbing every time.
You can try MindStudio free at mindstudio.ai.
Common Mistakes to Avoid
Using the Wrong Audio Format
The Realtime API is strict about audio format. It expects PCM16 at 24kHz, mono. Sending stereo audio, wrong sample rates, or MP3/AAC-encoded audio will produce silent or garbled responses. Check your audio pipeline carefully before debugging model behavior.
Over-Engineering the System Prompt
Translation tasks work best with clear, minimal instructions. Adding too many edge case rules to the system prompt often causes the model to over-think simple inputs. Start simple, test with real audio, then add complexity only where you see actual failures.
Ignoring Latency in UX Design
Even with the Realtime API’s low-latency architecture, there will be perceptible pauses — especially on slower connections or for longer utterances. Design your UI to handle this gracefully. A “listening” indicator and a “processing” state make the experience feel responsive even when there’s a brief delay.
Not Testing With Accented Speech
Whisper-based models perform well on a range of accents, but performance can vary. Test your agent with speakers who have regional accents in your target languages. This is especially important for deployments in multilingual regions where accent diversity is high.
Forgetting About Concurrent Users
A single WebSocket connection handles one conversation at a time. For multi-user deployments, you need to manage connection pools and ensure each user has their own session. Plan your server architecture for concurrency from the start, not as an afterthought.
Frequently Asked Questions
What languages does the OpenAI Realtime API support?
The Realtime API uses Whisper for speech recognition, which supports over 70 languages. Support quality varies — widely spoken languages like English, Spanish, French, Japanese, and Mandarin have the strongest performance. You can find the full list of supported languages in OpenAI’s Whisper documentation. For translation output, the model can respond in any language it was trained on, which broadly covers all major world languages.
How much does the OpenAI Realtime API cost?
Pricing for the Realtime API is based on audio input and output tokens, which are priced differently from text tokens. Audio input costs approximately $0.10 per minute, and audio output costs approximately $0.20 per minute (pricing subject to change — check OpenAI’s current pricing page for exact figures). For high-volume deployments, this can add up quickly, so monitoring usage from the start is important.
What’s the difference between the Realtime API and using Whisper + GPT-4 + TTS separately?
The chained approach (Whisper → GPT-4 → TTS) introduces latency at every step, typically 2–4 seconds end-to-end. The Realtime API collapses those three steps into a single model that processes audio input and generates audio output in one pass, reducing latency to under a second in good conditions. It also gives the model access to audio-native features like tone and emotion detection, which are lost when you transcribe first.
Can the Realtime API handle two-way conversations between two languages?
Yes, but you need to handle the speaker detection and language routing yourself. The API doesn’t natively identify which speaker is which — that’s application logic. For a true interpreter use case (Person A speaks Spanish, Person B speaks English, and the agent translates each side), you’d need to build in push-to-talk controls or speaker identification to manage turn-taking. It’s achievable, but requires careful UX design.
Do I need a backend server, or can I connect directly from the browser?
For production use, you need a backend server. Connecting directly from the browser would expose your OpenAI API key to anyone who views your page source. Your server acts as a secure proxy: it holds the API key, manages the WebSocket connection to OpenAI, and forwards audio between the browser and the API. For prototyping only, some developers connect directly, but this is never appropriate for public-facing apps.
How do I handle audio quality issues in real-world environments?
Background noise, poor microphones, and codec compression all affect accuracy. A few practical steps: set a reasonable VAD threshold (0.5–0.7 works for most environments), add noise suppression on the client side using the browser’s noiseSuppression constraint in getUserMedia, and design your system prompt to ask the model to handle unclear input gracefully (e.g., “If you cannot understand the input clearly, ask the user to repeat”).
Key Takeaways
- OpenAI’s Realtime API enables low-latency voice agents by processing audio in a single model pass, eliminating the chained STT → LLM → TTS lag.
- The API supports 70+ input languages via Whisper-based recognition and can respond in any language the model covers.
- Building a translation agent requires a WebSocket server (not a direct browser connection), careful audio format handling, and a well-crafted system prompt.
- Real-world deployments need to account for accent diversity, concurrent connections, connection resilience, and UX design for perceptible latency.
- For production use, the core API integration is just the beginning — surrounding workflows (logging, routing, CRM integration, notifications) often require as much thought as the core voice functionality.
- Tools like MindStudio make it faster to build and connect the surrounding automation layer without rebuilding common infrastructure from scratch.
Remy is new. The platform isn't.
Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.
If you’re ready to build a working AI agent without wiring up every piece manually, MindStudio’s no-code builder is a good place to start — you can have a working prototype running in under an hour.