Skip to main content
MindStudio
Pricing
Blog About
My Workspace
Text to Speech Model

Minimax Speech 2.8 HD

MiniMax Speech 2.8 HD is a studio-quality text-to-speech model that delivers broadcast-ready, emotionally expressive audio rivaling professional voice actors.

Publisher MiniMax
Type Text to Speech
Context Window 50,000 tokens
Training Data January 2026
Price $0.10/1K chars
Provider WaveSpeed
SPEECH

Studio-quality text-to-speech with emotional expression

MiniMax Speech 2.8 HD is a high-definition text-to-speech model developed by MiniMax, built on an autoregressive Transformer architecture with a Flow-VAE decoder. Instead of using traditional mel-spectrogram vocoders, it models speech in a learned latent space, which produces audio with natural cadence, proper intonation, and emotional depth. The model accepts up to 50,000 tokens of input text and was trained through January 2026.

The model offers 17 or more expressive voice presets spanning different genders, ages, and speaking styles, along with support for natural interjections such as laughs, sighs, and gasps embedded directly in text. Users can control emotion, speed, volume, pitch, sample rate, bitrate, channel configuration, and output format. These features make it well suited for audiobook production, video voiceovers, podcast creation, e-learning narration, accessibility applications, and game development.

What Minimax Speech 2.8 HD supports

Voice Presets

Provides 17 or more built-in voice options spanning different genders, ages, and speaking styles, selectable via a dropdown input.

Emotion Control

Allows setting the emotional tone of synthesized speech — such as happy or calm — to match the intended content context.

Natural Interjections

Supports embedding over 20 human sounds like (laughs), (sighs), and (gasps) directly in input text for lifelike delivery.

Audio Format Control

Exposes configurable parameters for sample rate, bitrate, channel configuration, and output format through dedicated select inputs.

Speech Rate & Pitch

Accepts numeric inputs to adjust playback speed, volume level, and pitch independently for fine-grained audio tuning.

Custom Pronunciation

Supports a custom pronunciation dictionary to handle brand names, acronyms, and specialized terminology with precise phonetic control.

Large Text Input

Accepts up to 50,000 tokens of input text in a single request, enabling long-form content like full audiobook chapters.

Ready to build with Minimax Speech 2.8 HD?

Get Started Free

Common questions about Minimax Speech 2.8 HD

What is the maximum input length for MiniMax Speech 2.8 HD?

The model supports a context window of 50,000 tokens, which allows for long-form content such as full chapters or extended scripts in a single request.

What audio output formats and quality settings are available?

Users can configure sample rate, bitrate, channel (mono or stereo), and output format through dedicated select inputs, giving full control over the final audio file.

Can I control how the voice sounds beyond just selecting a preset?

Yes. In addition to choosing from 17 or more voice presets, you can adjust speed, volume, pitch, and emotional tone, and embed natural interjections like (laughs) or (sighs) directly in the input text.

What is the training data cutoff for this model?

The model's training date is listed as January 2026.

What types of applications is MiniMax Speech 2.8 HD best suited for?

The model is designed for use cases that require high-fidelity, human-sounding audio, including audiobook production, video voiceovers, podcast creation, e-learning narration, accessibility tools, and game development.

Parameters & options

Voice Select

Voice preset to use for speech synthesis.

Default: Friendly_Person
Wise WomanFriendly PersonInspirational GirlDeep Voice ManCalm WomanCasual GuyLively GirlPatient ManYoung KnightDetermined ManLovely GirlDecent BoyImposing MannerElegant ManAbbessSweet Girl 2Exuberant Girl
Speed Number

Speech speed multiplier.

Default: 1
Volume Number

Volume level.

Default: 1
Pitch Number

Pitch adjustment.

0
Emotion Select

Emotional tone of the speech delivery.

HappyCalm
Sample Rate Select

Audio sample rate in Hz.

Default: 44100
16,000 Hz24,000 Hz32,000 Hz44,100 Hz (default)
Bitrate Select

Audio bitrate in bits per second.

Default: 128000
32,00064,000128,000 (default)256,000
Channel Select

Audio channel configuration.

MonoStereo
Format Select

Output audio format.

MP3WAVFLACOGGPCM
Language Boost Select

Boost recognition for a specific language.

AutoAfrikaansArabicBulgarianCatalanChineseChinese (Yue)CroatianCzechDanishDutchEnglishFilipinoFinnishFrenchGermanGreekHebrewHindiHungarianIndonesianItalianJapaneseKoreanMalayNorwegianNynorskPersianPolishPortugueseRomanianRussianSlovakSlovenianSpanishSwedishTamilThaiTurkishUkrainianVietnamese
English Normalization Toggle Group

Improves number-reading performance in English text (dates, currencies, etc.).

Start building with Minimax Speech 2.8 HD

No API keys required. Create AI-powered workflows with Minimax Speech 2.8 HD in minutes — free.