GPT Realtime Voice Models Explained: GPT Realtime 2, Translate, and Whisper
OpenAI released three new realtime voice models via API. Here's what GPT Realtime 2, Realtime Translate, and Realtime Whisper do and when to use each.
Three New Voice Models, One API — What OpenAI Just Changed
OpenAI shipped three distinct realtime voice models via its API, and each one solves a different problem. If you’ve been trying to figure out which one to use — GPT Realtime 2, Realtime Translate, or the updated Whisper transcription models — this breakdown covers exactly that.
The release matters because voice AI has historically involved stitching together multiple components: a speech-to-text model, a language model, and a text-to-speech engine. That pipeline introduced latency, fragmentation, and complexity. OpenAI’s realtime models collapse some of that, but they don’t all work the same way. Knowing what each does — and what it doesn’t — is what makes the difference between choosing the right model and burning money on the wrong one.
What These Models Actually Are (and Why They’re Different)
Before getting into each model individually, it helps to understand the underlying distinction between them. OpenAI now offers two architectures for voice:
Audio-in / audio-out (end-to-end): The model receives raw audio and responds with raw audio. There’s no intermediate text step. This is how the Realtime API works. It preserves tone, pacing, and emotional nuance because nothing is being converted to text and back again.
Speech-to-text (transcription-first): Audio goes in, text comes out. You can then pass that text to a language model or store it as a transcript. This is the Whisper path.
Day one: idea. Day one: app.
Not a sprint plan. Not a quarterly OKR. A finished product by end of day.
GPT Realtime 2 and Realtime Translate live in the first category. The new GPT-4o-based transcription models (often grouped under the “Whisper” family in documentation) live in the second.
They’re not interchangeable. They’re for different workflows.
GPT Realtime 2: Updated Low-Latency Voice Conversations
GPT Realtime 2 refers to the updated generation of OpenAI’s gpt-4o-realtime-preview model, which powers the Realtime API. This is the model behind genuinely interactive, spoken conversations — the kind where latency matters because a half-second delay feels wrong.
How It Works
The Realtime API maintains a persistent WebSocket connection between the client and the model. Audio streams in continuously, the model processes it in near real-time, and audio streams back out. Interruptions are handled gracefully — if someone starts talking while the model is responding, it stops and adjusts, similar to how a real conversation works.
GPT Realtime 2 improves on the first generation in a few ways:
- Lower latency — Response times are faster, particularly in noisy environments or when audio input isn’t clean
- Better turn-taking — The model is more accurate at detecting when someone has finished speaking versus just pausing
- Improved audio quality — Output audio sounds more natural, with better prosody
- Broader language support — More languages are now supported in the end-to-end audio pipeline
When to Use It
GPT Realtime 2 is the right choice when the experience itself is conversational. Think:
- Voice agents and AI phone systems
- Real-time customer service bots
- Interactive voice response (IVR) replacements
- Live AI tutors or coaching tools
- Any application where latency above ~300ms would feel awkward
The tradeoff is cost and complexity. Realtime API pricing is higher than standard completions because audio tokens are expensive to process. And since the conversation is stateful (maintained over a WebSocket), you need to manage session lifecycle carefully.
GPT Realtime Translate: Live Speech-to-Speech Translation
GPT Realtime Translate is a dedicated variant of the Realtime API optimized specifically for speech translation — converting spoken audio in one language directly to spoken audio in another, in real time.
This is different from what you might build yourself by chaining Whisper (transcription) + GPT-4o (translation) + TTS (text-to-speech). That pipeline works, but it introduces 1–3 seconds of latency per exchange. Realtime Translate collapses that into a single, continuous operation.
How It Works
Like GPT Realtime 2, this model runs over a persistent WebSocket. You stream audio in, specify a target language, and receive translated audio back. The model handles:
- Detecting the source language automatically (or you can specify it)
- Translating at the semantic level, not word-by-word
- Generating natural-sounding output in the target language
- Preserving approximate speaking rhythm and pacing
When to Use It
Realtime Translate is designed for scenarios where you need fast, spoken translation and the source language is variable or unknown:
- Live multilingual support agents (a rep speaks English, a customer hears Spanish)
- Real-time conference or meeting translation tools
- International customer service automation
- Language learning applications with spoken feedback
- Accessibility tools for multilingual environments
Remy doesn't build the plumbing. It inherits it.
Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.
Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.
One thing worth noting: the model is optimized for fluency and low latency, not verbatim accuracy. If you need a precise legal or medical transcript in a different language, a slower, more deliberate transcription-then-translation pipeline will serve you better. Realtime Translate is for natural conversation flow, not precision documentation.
The New Whisper Models: GPT-4o Transcribe and GPT-4o Mini Transcribe
The “Whisper” side of this release refers to two new speech-to-text models: gpt-4o-transcribe and gpt-4o-mini-transcribe. These replace — or more accurately, supplement — the existing Whisper v2 and v3 models that have been available via the API.
The key difference is that these aren’t the original Whisper architecture. They’re built on the GPT-4o model family, which means they benefit from GPT-4o’s language understanding capabilities during transcription — not just pattern-matching audio to text, but actually comprehending what’s being said.
What’s Improved Over Whisper v3
Word error rate: Both gpt-4o-transcribe and gpt-4o-mini-transcribe show meaningfully lower word error rates, particularly on:
- Accented speech
- Technical vocabulary and jargon
- Overlapping speakers
- Low-quality or noisy audio
Context awareness: Because these models draw on GPT-4o’s language model backbone, they handle ambiguous words and phrases better. “Their, there, they’re” gets resolved correctly more often based on context.
Punctuation and formatting: Output is cleaner and more consistently formatted out of the box, which reduces post-processing work.
Speed: gpt-4o-mini-transcribe is specifically designed for high-throughput, cost-efficient transcription where you’re processing large volumes of audio and don’t need the highest possible accuracy.
When to Use Each Transcription Model
| Model | Best for |
|---|---|
gpt-4o-transcribe | High-stakes transcription — medical notes, legal recordings, customer calls where accuracy is critical |
gpt-4o-mini-transcribe | Bulk transcription jobs, real-time captions where speed and cost matter more than perfection |
| Whisper v3 (existing) | Legacy integration compatibility, well-established benchmarks, local deployment via open-source builds |
The gpt-4o-mini-transcribe model is particularly interesting for developers building tools that need fast captions or live note-taking. It won’t catch every nuance, but it’s fast and significantly cheaper per audio minute than the full model.
Comparing All Three: Which One Do You Actually Need?
Here’s a practical side-by-side view of the three model types:
| GPT Realtime 2 | Realtime Translate | GPT-4o Transcribe | |
|---|---|---|---|
| Input | Audio (streaming) | Audio (streaming) | Audio (file or stream) |
| Output | Audio (streaming) | Audio (streaming, different language) | Text transcript |
| Architecture | End-to-end audio | End-to-end audio | Speech-to-text |
| Latency | Very low (~200–400ms) | Very low (~200–400ms) | Moderate (batch) or low (streaming) |
| Use case | Conversational AI | Multilingual voice | Transcription, notes, captions |
| State management | WebSocket (stateful) | WebSocket (stateful) | Stateless (file upload or stream) |
| Cost tier | High | High | Moderate / Low (mini) |
The clearest rule of thumb: if you need spoken responses, use the Realtime API models. If you need text output from audio, use the transcription models.
Don’t over-engineer it. Some developers reach for the Realtime API when they actually just need transcription. The Realtime API is more expensive and requires managing a persistent connection — that complexity is only worth it when the experience demands low-latency voice interaction.
Common Integration Patterns
Voice Agent with Interruption Handling
The most common pattern for GPT Realtime 2 is a voice agent that can be interrupted. A user asks a question, the model starts responding, the user interjects — and the model handles it gracefully.
This requires:
- A WebSocket connection to the Realtime API
- Continuous audio input streaming
- Server-side voice activity detection (VAD) or client-side VAD
- Handling
input_audio_buffer.speech_startedevents to trigger interruption logic
Hire a contractor. Not another power tool.
Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.
OpenAI’s Realtime API documentation covers the event lifecycle in detail. The main thing to get right is interruption: cutting off audio output cleanly when the user speaks again is what separates a good voice agent from an annoying one.
Live Transcription Pipeline
For use cases like meeting transcription or customer call logging, the pattern is simpler:
- Stream or upload audio to
gpt-4o-transcribe - Receive segmented text output with timestamps
- Post-process for speaker diarization if needed (this isn’t built in — you’ll need a separate diarization step or a third-party tool)
If you need speaker labels, you’ll still want to layer in a diarization solution. The transcription models output text, not speaker attribution.
Multilingual Support Bot
The most interesting pattern for Realtime Translate is a multilingual support workflow where:
- A customer calls or messages in their native language
- The agent receives their audio via the Realtime API with translation enabled
- The agent responds in English (or any specified language)
- The customer hears a translated response in their language
This eliminates the need for human interpreters on routine support calls and dramatically expands the reach of an existing support team.
How MindStudio Fits Into Realtime Voice Workflows
Building directly on the OpenAI Realtime API requires managing WebSocket connections, handling event streams, and wiring up audio I/O — all of which takes real engineering time. For teams that want to use these models without building that infrastructure from scratch, MindStudio is worth looking at.
MindStudio’s no-code builder gives you access to the latest OpenAI audio and voice models without needing to configure API keys or manage session state manually. You can build AI agents on MindStudio that incorporate transcription, voice response, and workflow logic — in the same visual builder, without juggling multiple services.
A practical example: say you want a voice-driven customer support agent that transcribes calls, routes issues based on content, and logs everything to your CRM. In a raw API setup, that’s Realtime API + Whisper + GPT-4o + your CRM integration — four different systems to coordinate. In MindStudio, that’s a single workflow with those capabilities wired together, plus pre-built integrations with tools like HubSpot and Salesforce already available.
If you’re a developer who prefers working in code, MindStudio’s Agent Skills Plugin (the @mindstudio-ai/agent npm SDK) lets you call MindStudio workflows from within agent frameworks like LangChain or CrewAI. So you can keep your custom logic in code while delegating the voice processing and integration plumbing to MindStudio.
You can try MindStudio free at mindstudio.ai.
Access, Pricing, and Availability
All three model types are available through the OpenAI API, though access tiers and pricing differ.
GPT Realtime 2 / Realtime API:
- Available to developers with API access
- Priced per audio token (input and output), with text tokens billed separately
- Audio input is currently priced at $100 per million tokens; audio output at $200 per million tokens (check OpenAI’s pricing page for current rates — these change)
- The Realtime API requires a persistent WebSocket connection — it’s not a standard REST endpoint
Realtime Translate:
- Available as a mode within the Realtime API
- Same pricing structure as GPT Realtime 2
- Target language is specified via session parameters
GPT-4o Transcribe / Mini Transcribe:
- Available via the
/audio/transcriptionsendpoint - Priced per minute of audio, significantly lower than Realtime API costs
gpt-4o-mini-transcribeis the cheaper option for high-volume jobs
For most developers, the transcription models are the accessible, affordable starting point. The Realtime API models are more specialized and more expensive — reserved for applications where the real-time voice experience justifies the cost.
Frequently Asked Questions
What is GPT Realtime 2 and how is it different from the original Realtime API?
GPT Realtime 2 is the updated generation of OpenAI’s gpt-4o-realtime-preview model, available through the Realtime API. It offers lower latency, better turn-detection, improved audio output quality, and broader language support compared to the original release. It still uses the same WebSocket-based architecture — the improvements are under the hood in model performance.
Can GPT Realtime Translate replace a human interpreter?
For routine, conversational use cases — customer support, basic information exchange, casual conversation — it’s capable enough to handle a large percentage of interactions without human review. For high-stakes scenarios like medical consultations, legal proceedings, or diplomatic contexts, it shouldn’t replace human interpreters. The model is optimized for fluency and speed, not verbatim precision or domain-specific accuracy.
What’s the difference between Whisper v3 and gpt-4o-transcribe?
Whisper v3 is a standalone speech-to-text model based on OpenAI’s original Whisper architecture. gpt-4o-transcribe is built on the GPT-4o model family, which gives it stronger language understanding during transcription. In practice, gpt-4o-transcribe shows lower word error rates — especially on accented speech, technical vocabulary, and noisy audio — but it costs more per minute than Whisper v3. Whisper v3 is also available as an open-source model you can run locally; gpt-4o-transcribe is API-only.
When should I use the Realtime API versus a transcription + LLM pipeline?
Use the Realtime API when you need genuine real-time spoken conversation with latency under ~500ms — voice agents, phone systems, live interactive tools. Use a transcription + LLM pipeline when you’re processing recorded audio, don’t need a spoken response, or want to reduce cost and complexity. Most use cases that involve analyzing or responding to audio content (meeting notes, call summaries, support ticket generation) are better served by the transcription path.
Do these models support speaker diarization (identifying who said what)?
Not natively. The GPT-4o transcription models output text transcripts but don’t label individual speakers. For speaker diarization, you’ll need to use a separate service — options include Pyannote, AssemblyAI, or Deepgram — and layer that on top of your transcription output. This is a common pattern for meeting transcription tools.
How do I get access to these models through the API?
You need an OpenAI API account with billing enabled. The Realtime API (gpt-4o-realtime-preview) requires a WebSocket connection rather than a standard HTTP request — most API client libraries have been updated to support this. The transcription models (gpt-4o-transcribe, gpt-4o-mini-transcribe) are available through the standard /audio/transcriptions REST endpoint and can be called the same way as existing Whisper endpoints. Check OpenAI’s documentation for the current model identifiers, as these change with new preview releases.
Key Takeaways
- GPT Realtime 2 is for low-latency spoken conversation — voice agents, phone systems, and interactive tools where latency matters
- GPT Realtime Translate handles real-time speech-to-speech translation in a single API call, eliminating the need to chain transcription + translation + TTS
- GPT-4o Transcribe and Mini Transcribe are improved speech-to-text models with better accuracy than Whisper v3, especially on accented speech and technical language
- The Realtime API uses a persistent WebSocket connection; the transcription models use standard REST endpoints — pick the right architecture for your use case
- Cost is a meaningful variable: Realtime API audio tokens are significantly more expensive than transcription-only models — don’t use the Realtime API when transcription is all you need
- If you want to build voice-powered workflows without managing the API infrastructure yourself, MindStudio gives you access to these models in a no-code builder with integrations already wired in — and you can build an AI agent in under an hour