Whisper

About Whisper

Multilingual speech recognition and audio translation

Whisper is a general-purpose speech recognition model developed by OpenAI and made available via the OpenAI API under the model ID whisper-1. It was trained on a large dataset of diverse audio, enabling it to handle a wide range of accents, background noise conditions, and technical vocabulary. What distinguishes Whisper is its multitask design: it can perform not only speech-to-text transcription but also speech translation into English and automatic language identification within a single model.

Whisper is well suited for developers building transcription pipelines, subtitle generation tools, voice interfaces, or any application that requires converting spoken audio into structured text. It supports multilingual input, making it useful for global applications where audio may arrive in different languages. The model accepts common audio formats and returns transcriptions or translations as plain text or with optional timestamps.

Capabilities

What Whisper supports

Speech Transcription

Converts spoken audio into written text, supporting a wide range of languages, accents, and audio quality levels.

Speech Translation

Translates spoken audio from supported non-English languages directly into English text in a single pass.

Language Identification

Automatically detects the language spoken in an audio file without requiring the caller to specify it in advance.

Timestamp Output

Optionally returns word- or segment-level timestamps alongside transcribed text, useful for subtitle and caption generation.

Audio Format Support

Accepts multiple common audio formats including mp3, mp4, mpeg, mpga, m4a, wav, and webm via the API.

FAQ

Common questions about Whisper

What is the maximum audio file size Whisper accepts via the API?

The OpenAI API enforces a 25 MB file size limit per audio file submitted to the Whisper endpoint.

Does Whisper have a context window like text models?

Whisper is an audio model, not a text model, so it does not have a token-based context window. Audio inputs are processed in segments internally.

What languages does Whisper support for transcription?

Whisper supports transcription in dozens of languages. It was trained on multilingual audio data and can identify and transcribe many of the world's most widely spoken languages.

Can Whisper translate languages other than English into English?

Yes. Whisper's translation capability converts spoken audio in supported non-English languages into English text. Translation into languages other than English is not supported by the model.

How is Whisper priced on the OpenAI API?

Whisper is billed per minute of audio processed. Pricing details are published on OpenAI's pricing page and may change over time.

Community Discussion

What people think about Whisper

Community discussion around Whisper is limited in the provided threads, with only one directly relevant post covering an open-source GUI called EasyWhisperUI that adds cross-platform GPU support for running Whisper locally on Windows and Mac. That thread attracted modest engagement, suggesting a niche but active audience of developers interested in self-hosted transcription workflows.

The other thread found is unrelated to Whisper and concerns a different model entirely. No significant community concerns or limitations specific to Whisper were surfaced in these threads.

r/ClaudeAI 2,263 pts 319 comments

Claude Code is a Beast – Tips from 6 Months of Hardcore Use

r/LocalLLaMA 30 pts 35 comments

EasyWhisperUI - Open-Source Easy UI for OpenAI’s Whisper model with cross platform GPU support (Windows/Mac)

View more discussions →

Resources