Whisper
General-purpose speech recognition model
Multilingual speech recognition and audio translation
Whisper is a general-purpose speech recognition model developed by OpenAI and made available via the OpenAI API under the model ID whisper-1. It was trained on a large dataset of diverse audio, enabling it to handle a wide range of accents, background noise conditions, and technical vocabulary. What distinguishes Whisper is its multitask design: it can perform not only speech-to-text transcription but also speech translation into English and automatic language identification within a single model.
Whisper is well suited for developers building transcription pipelines, subtitle generation tools, voice interfaces, or any application that requires converting spoken audio into structured text. It supports multilingual input, making it useful for global applications where audio may arrive in different languages. The model accepts common audio formats and returns transcriptions or translations as plain text or with optional timestamps.
What Whisper supports
Speech Transcription
Converts spoken audio into written text, supporting a wide range of languages, accents, and audio quality levels.
Speech Translation
Translates spoken audio from supported non-English languages directly into English text in a single pass.
Language Identification
Automatically detects the language spoken in an audio file without requiring the caller to specify it in advance.
Timestamp Output
Optionally returns word- or segment-level timestamps alongside transcribed text, useful for subtitle and caption generation.
Audio Format Support
Accepts multiple common audio formats including mp3, mp4, mpeg, mpga, m4a, wav, and webm via the API.
Ready to build with Whisper?
Get Started FreeCommon questions about Whisper
What is the maximum audio file size Whisper accepts via the API?
The OpenAI API enforces a 25 MB file size limit per audio file submitted to the Whisper endpoint.
Does Whisper have a context window like text models?
Whisper is an audio model, not a text model, so it does not have a token-based context window. Audio inputs are processed in segments internally.
What languages does Whisper support for transcription?
Whisper supports transcription in dozens of languages. It was trained on multilingual audio data and can identify and transcribe many of the world's most widely spoken languages.
Can Whisper translate languages other than English into English?
Yes. Whisper's translation capability converts spoken audio in supported non-English languages into English text. Translation into languages other than English is not supported by the model.
How is Whisper priced on the OpenAI API?
Whisper is billed per minute of audio processed. Pricing details are published on OpenAI's pricing page and may change over time.
What people think about Whisper
Community discussion around Whisper is limited in the provided threads, with only one directly relevant post covering an open-source GUI called EasyWhisperUI that adds cross-platform GPU support for running Whisper locally on Windows and Mac. That thread attracted modest engagement, suggesting a niche but active audience of developers interested in self-hosted transcription workflows.
The other thread found is unrelated to Whisper and concerns a different model entirely. No significant community concerns or limitations specific to Whisper were surfaced in these threads.
Claude Code is a Beast – Tips from 6 Months of Hardcore Use
EasyWhisperUI - Open-Source Easy UI for OpenAI’s Whisper model with cross platform GPU support (Windows/Mac)
Documentation & links
Explore similar models
Start building with Whisper
No API keys required. Create AI-powered workflows with Whisper in minutes — free.