Grok Imagine
X.ai's fast, native text-to-video and image-to-video generation model with built-in audio, multiple aspect ratios, and flexible creative modes.
Text and image to video with native audio
Grok Imagine Video is a video generation model developed by X.ai, capable of converting text prompts or static images into short video clips with synchronized audio. It launched in August 2025 and reached a major 1.0 release in February 2026. The model runs on X.ai's proprietary Aurora autoregressive engine, trained on 110,000 NVIDIA GB200 GPUs, and generates 720p video at 24 fps with clip lengths between 6 and 15 seconds.
What sets Grok Imagine Video apart is its built-in audio generation, which produces character dialogue, background music, and sound effects alongside the visuals without requiring separate post-production. It supports seven aspect ratios — including 16:9, 9:16, and 1:1 — and offers three creative modes: Normal, Fun, and Spicy. Generation typically completes in around 30 seconds, making it well suited for social media creators, marketers, and content teams that need fast turnaround on short-form video.
What Grok Imagine supports
Text-to-Video
Generates short video clips from a text prompt, producing 720p output at 24 fps with clip lengths ranging from 6 to 15 seconds.
Image-to-Video
Animates a static input image into a video clip, accepting image URLs as a direct input type.
Native Audio Generation
Automatically generates synchronized audio — including dialogue, background music, and sound effects — as part of the video output without separate editing.
Multiple Aspect Ratios
Supports seven aspect ratios (16:9, 9:16, 4:3, 3:4, 2:3, 3:2, and 1:1), selectable via the model's select input type.
Creative Mode Selection
Offers three generation modes — Normal, Fun, and Spicy — allowing users to tune tone and content style per request.
Fast Generation Speed
Produces video clips in approximately 30 seconds per generation, enabling high-volume content workflows.
Video URL Input
Accepts video URLs as a direct input type, enabling workflows that reference or build on existing video assets.
Ready to build with Grok Imagine?
Get Started FreeCommon questions about Grok Imagine
What is the context window for Grok Imagine Video?
The model has a context window of 5,000 tokens, which governs the length and detail of text prompts it can process.
What resolution and frame rate does the model output?
Grok Imagine Video generates clips at 720p resolution and 24 frames per second. It does not currently support 1080p or 4K output.
How long are the video clips it produces?
Generated clips range from 6 to 15 seconds in length.
Where can I find pricing information for this model?
Pricing details are available on the X.ai models and pricing page at https://docs.x.ai/developers/models.
What is the training data cutoff for Grok Imagine Video?
According to the available metadata, the model's training date is listed as August 2025.
What input types does the model accept?
The model accepts image URLs, video URLs, select inputs (for options like aspect ratio and creative mode), and numeric inputs.
What people think about Grok Imagine
Community discussion around Grok Imagine Video has been generally positive, with users noting its entry into the public API as a notable milestone and discussing its placement on benchmark leaderboards.
Some commenters have focused on its speed and accessibility relative to other video generation tools, while others have raised questions about output quality and use cases for short-form content creation.
Parameters & options
Explore similar models
Start building with Grok Imagine
No API keys required. Create AI-powered workflows with Grok Imagine in minutes — free.