Speech to Text Model

Scribe v2

ElevenLabs' state-of-the-art speech recognition model, delivering highly accurate transcription across 90+ languages with advanced features like speaker diarization, entity detection, and precise word-level timestamps.

Start Building with Scribe v2 View All Models

Publisher

ElevenLabs

Type Transcription

Price $0.007/min

LATEST

Try Scribe v2 →

About Scribe v2

Multilingual speech transcription with speaker diarization

Scribe v2 is ElevenLabs' flagship speech-to-text model, built to transcribe audio accurately across more than 90 languages with automatic language detection. It supports speaker diarization for up to 32 speakers, word-level timestamps, and entity detection across 56 named entity types, making it one of the more feature-rich transcription models available through an API. Developers can also supply up to 100 custom keyterms to improve recognition of domain-specific vocabulary, names, or technical jargon.

Scribe v2 is well suited for applications where transcription accuracy and rich metadata matter — such as meeting summarization, podcast indexing, media subtitling, and legal or medical documentation workflows. Its dynamic audio tagging feature automatically labels non-speech events, which adds context beyond spoken words. The combination of precise timing data and speaker attribution makes it a practical choice for any pipeline where knowing who said what and when is a requirement.

Capabilities

What Scribe v2 supports

Multilingual Transcription

Transcribes spoken audio in over 90 languages with automatic language detection, requiring no manual language configuration.

Speaker Diarization

Identifies and separates individual speakers within a single audio file, supporting up to 32 distinct speakers.

Word-Level Timestamps

Provides precise timing for every transcribed word, enabling accurate alignment with audio or video content.

Entity Detection

Automatically identifies and labels named entities within transcriptions, covering up to 56 entity types.

Keyterm Prompting

Accepts up to 100 custom keyterms to guide the model toward accurate recognition of domain-specific vocabulary or proper nouns.

Audio Event Tagging

Detects and labels non-speech audio events dynamically, adding contextual metadata beyond spoken words.

Ready to build with Scribe v2?

Get Started Free

FAQ

Common questions about Scribe v2

How many languages does Scribe v2 support?

Scribe v2 supports transcription in over 90 languages and can automatically detect the spoken language without requiring manual configuration.

Does Scribe v2 have a context window limit?

No context window is specified in the available metadata for Scribe v2, as it is a speech-to-text model rather than a text-based language model. Limits, if any, would apply to audio file length or size as defined by the ElevenLabs API.

How many speakers can Scribe v2 distinguish in a single file?

Scribe v2's speaker diarization feature can identify and separate up to 32 individual speakers within a single audio file.

Can I improve recognition of specialized terminology?

Yes. Scribe v2 supports keyterm prompting, which allows you to supply up to 100 custom terms — such as product names, technical jargon, or proper nouns — to guide the model toward more accurate recognition.

What types of named entities can Scribe v2 detect?

Scribe v2 can automatically identify and label up to 56 types of named entities within a transcription, such as people, organizations, and locations.

Resources