Minimax Speech 2.8 HD
MiniMax Speech 2.8 HD is a studio-quality text-to-speech model that delivers broadcast-ready, emotionally expressive audio rivaling professional voice actors.
Studio-quality text-to-speech with emotional expression
MiniMax Speech 2.8 HD is a high-definition text-to-speech model developed by MiniMax, built on an autoregressive Transformer architecture with a Flow-VAE decoder. Instead of using traditional mel-spectrogram vocoders, it models speech in a learned latent space, which produces audio with natural cadence, proper intonation, and emotional depth. The model accepts up to 50,000 tokens of input text and was trained through January 2026.
The model offers 17 or more expressive voice presets spanning different genders, ages, and speaking styles, along with support for natural interjections such as laughs, sighs, and gasps embedded directly in text. Users can control emotion, speed, volume, pitch, sample rate, bitrate, channel configuration, and output format. These features make it well suited for audiobook production, video voiceovers, podcast creation, e-learning narration, accessibility applications, and game development.
What Minimax Speech 2.8 HD supports
Voice Presets
Provides 17 or more built-in voice options spanning different genders, ages, and speaking styles, selectable via a dropdown input.
Emotion Control
Allows setting the emotional tone of synthesized speech — such as happy or calm — to match the intended content context.
Natural Interjections
Supports embedding over 20 human sounds like (laughs), (sighs), and (gasps) directly in input text for lifelike delivery.
Audio Format Control
Exposes configurable parameters for sample rate, bitrate, channel configuration, and output format through dedicated select inputs.
Speech Rate & Pitch
Accepts numeric inputs to adjust playback speed, volume level, and pitch independently for fine-grained audio tuning.
Custom Pronunciation
Supports a custom pronunciation dictionary to handle brand names, acronyms, and specialized terminology with precise phonetic control.
Large Text Input
Accepts up to 50,000 tokens of input text in a single request, enabling long-form content like full audiobook chapters.
Ready to build with Minimax Speech 2.8 HD?
Get Started FreeCommon questions about Minimax Speech 2.8 HD
What is the maximum input length for MiniMax Speech 2.8 HD?
The model supports a context window of 50,000 tokens, which allows for long-form content such as full chapters or extended scripts in a single request.
What audio output formats and quality settings are available?
Users can configure sample rate, bitrate, channel (mono or stereo), and output format through dedicated select inputs, giving full control over the final audio file.
Can I control how the voice sounds beyond just selecting a preset?
Yes. In addition to choosing from 17 or more voice presets, you can adjust speed, volume, pitch, and emotional tone, and embed natural interjections like (laughs) or (sighs) directly in the input text.
What is the training data cutoff for this model?
The model's training date is listed as January 2026.
What types of applications is MiniMax Speech 2.8 HD best suited for?
The model is designed for use cases that require high-fidelity, human-sounding audio, including audiobook production, video voiceovers, podcast creation, e-learning narration, accessibility tools, and game development.
Parameters & options
Voice preset to use for speech synthesis.
Speech speed multiplier.
Volume level.
Pitch adjustment.
Emotional tone of the speech delivery.
Audio sample rate in Hz.
Audio bitrate in bits per second.
Audio channel configuration.
Output audio format.
Boost recognition for a specific language.
Improves number-reading performance in English text (dates, currencies, etc.).
Explore similar models
Start building with Minimax Speech 2.8 HD
No API keys required. Create AI-powered workflows with Minimax Speech 2.8 HD in minutes — free.