How to Generate AI Videos with Claude Code, HyperFrames, and ElevenLabs
This open-source workflow uses Claude Code, HyperFrames, and ElevenLabs to generate fully synced AI videos from a single topic prompt in under 15 minutes.
From Single Prompt to Finished Video in Under 15 Minutes
Most people who want to generate AI videos end up stitching together three or four separate tools manually — writing a script in one place, generating images in another, recording or synthesizing audio somewhere else, and then cutting it all together in a video editor. It works, but it’s slow and inconsistent.
This open-source workflow changes that. By combining Claude Code, HyperFrames, and ElevenLabs, you can go from a single topic prompt to a fully synced AI-generated video in under 15 minutes — without touching a video editor. The workflow handles scripting, frame generation, voiceover synthesis, and final assembly automatically.
This guide breaks down exactly how each piece works, how to set it up, and where things typically go wrong.
What Each Tool Actually Does
Before running the workflow, it helps to understand what role each component plays. These aren’t interchangeable — each one handles a distinct stage of video production.
Claude Code
Claude Code is Anthropic’s agentic coding environment. Unlike a standard chatbot, it can write code, execute it, iterate on errors, and manage files — all autonomously. In this workflow, Claude Code acts as the orchestrator.
It takes your topic prompt, writes a structured video script, breaks it into timed segments, coordinates calls to the other tools, and assembles the final output. You’re not writing a single line of code yourself — Claude Code handles it.
HyperFrames
Built like a system. Not vibe-coded.
Remy manages the project — every layer architected, not stitched together at the last second.
HyperFrames is an open-source library for AI-driven frame and image sequence generation. It takes your script segments and generates the visual content for each scene — either as static frames, animated sequences, or short video clips depending on your configuration.
It supports multiple image and video generation backends, so you can route requests to models like FLUX, Stable Diffusion, or others depending on what you have access to. The key feature is that it maps script timing to frame output, which is what makes audio-visual sync possible later.
ElevenLabs
ElevenLabs handles the voiceover. It converts your script text to natural-sounding audio using its speech synthesis models. You can choose from a library of preset voices or clone a custom voice.
In this workflow, ElevenLabs doesn’t just generate audio — it also returns timing metadata (word-level timestamps) that the assembly step uses to sync narration to the correct frames.
Prerequisites Before You Start
You’ll need a few things in place before running the workflow:
- Claude Code — available through Anthropic’s API or the Claude.ai Pro plan with Code access enabled
- HyperFrames — install from the GitHub repository:
npm install hyperframesorpip install hyperframesdepending on your stack - ElevenLabs API key — a free tier account works for testing, but production use will require a paid plan for higher character limits
- ffmpeg — the workflow uses ffmpeg to merge audio and video tracks in the final assembly step; install it via your system package manager
- A working Node.js or Python environment (v18+ for Node, 3.10+ for Python)
If you’re planning to use local image generation models via HyperFrames, you’ll also want a GPU with at least 8GB VRAM for reasonable generation speeds.
How the Workflow Is Structured
The entire pipeline runs in five stages. Claude Code manages the transitions between each one automatically.
Stage 1: Topic Ingestion and Script Generation
You give Claude Code a prompt like: “Create a 90-second video explaining how neural networks learn from data.”
Claude Code parses the prompt, determines the appropriate video length, and writes a structured script. The script format matters here — it needs to be segmented with timing markers so that HyperFrames and ElevenLabs can work from the same source of truth.
A script segment looks roughly like this:
[00:00 - 00:12]
NARRATION: Neural networks learn by adjusting millions of tiny connections based on examples.
VISUAL: Abstract animation of nodes and weighted edges shifting
MOOD: technical, clean
Claude Code writes all of this programmatically. You don’t edit it unless you want to.
Stage 2: Frame Generation via HyperFrames
With the script parsed, Claude Code calls HyperFrames for each segment. HyperFrames reads the visual description and mood tags, constructs a prompt for the configured image/video model, and generates the frames.
You can configure HyperFrames to output:
- Static frames — one image per segment, good for slide-style videos
- Animated loops — short looping clips (2–4 seconds) per segment
- Continuous clips — longer video clips stitched together per segment
Frame resolution, frame rate, and output format are all configurable. For most use cases, 1080p at 24fps is a reasonable default.
Coding agents automate the 5%. Remy runs the 95%.
The bottleneck was never typing the code. It was knowing what to build.
HyperFrames also handles batching. If your script has 10 segments, it doesn’t generate them sequentially in a way that blocks the process — it queues them efficiently and returns all outputs before the next stage begins.
Stage 3: Voiceover Synthesis via ElevenLabs
While frames are generating (or immediately after, depending on your setup), Claude Code sends the narration text from each segment to ElevenLabs.
The ElevenLabs API returns two things: the audio file and a JSON object with word-level timestamps. That timestamp data is what enables precise sync later.
You configure voice selection in the initial prompt or in a config file. ElevenLabs supports stability and similarity boost parameters that affect how consistent and expressive the voice sounds — the defaults work well for most content.
Stage 4: Sync and Assembly
This is where the workflow earns its efficiency. Claude Code uses the ElevenLabs timestamp data alongside the HyperFrames output timing to construct an ffmpeg command that merges audio and video.
Each narration segment gets mapped to its corresponding visual segment. If ElevenLabs returns that the narration for segment 3 runs 14.2 seconds instead of the expected 12, Claude Code adjusts the frame duration for that segment to match — rather than letting audio and video drift out of sync.
The assembly step produces a single .mp4 file with the narration, visuals, and optional background music track (if configured).
Stage 5: Output and Review
The finished video lands in your output directory. Claude Code also writes a brief log file that includes:
- Total runtime
- Model used for each stage
- Any segments where timing adjustments were made
- File sizes for audio, frames, and final output
Review the video. If something’s off — a segment looks wrong, a phrase sounds awkward — you can regenerate individual segments without re-running the whole workflow.
Setting Up the Workflow Step by Step
Here’s how to get from zero to your first generated video.
Step 1: Clone or download the workflow repository
The open-source workflow repository includes a workflow.js or workflow.py entry point, a config.yaml for model and voice settings, and example prompts.
Step 2: Add your API keys
Create a .env file in the root directory:
ELEVENLABS_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_here
HYPERFRAMES_MODEL_ENDPOINT=your_model_endpoint
If you’re routing image generation through a hosted API (like Replicate or a self-hosted ComfyUI instance), add those credentials too.
Step 3: Configure your defaults
Open config.yaml and set:
voice_id— your preferred ElevenLabs voiceoutput_resolution— default1920x1080frame_style—static,animated, orclipvideo_length_target— in seconds; Claude Code will try to match thisbackground_music— path to an optional audio file, or leave blank
Step 4: Run the workflow
node workflow.js --prompt "Your topic here"
Or for Python:
python workflow.py --prompt "Your topic here"
Claude Code takes over from here. You’ll see terminal output logging each stage as it completes.
Step 5: Find your output
The finished .mp4 and generation log are written to /output/[timestamp]/. That’s it.
Common Issues and How to Fix Them
Even a well-structured workflow has rough edges. Here are the most frequent problems and their fixes.
Audio and video fall out of sync
This usually happens when ElevenLabs returns audio that’s significantly longer or shorter than the target segment duration. The workflow’s auto-adjustment should handle minor drift, but if a segment’s timing is way off, Claude Code may miscalculate.
Fix: Reduce the stability parameter in ElevenLabs config (lower stability = more natural pacing, but more variance). Or manually cap segment narration length — shorter sentences give ElevenLabs less room to stretch.
Frame generation is slow
If you’re using a hosted image model and generation is taking more than 3–4 minutes per segment, you’re likely hitting API rate limits or using a model that’s too slow for the default timeout.
Fix: Switch to a faster model (FLUX Schnell instead of FLUX Dev, for example), reduce resolution during testing, or increase the generation_timeout value in config.
Claude Code misinterprets the visual descriptions
HyperFrames passes visual prompts directly to your image model. If Claude Code writes vague or inconsistent visual descriptions, frame quality suffers.
Fix: Add a visual_style_guide field to your config. Something like “minimalist, dark background, clean typography, abstract shapes” gives Claude Code a consistent reference when writing visual descriptions for each segment.
ffmpeg errors during assembly
This is almost always a path or permissions issue. Make sure ffmpeg is installed and accessible in your system PATH (ffmpeg -version should work in your terminal). Also verify that the output directory exists and is writable.
Where MindStudio Fits Into This Kind of Workflow
Claude Code is powerful, but it runs locally and requires terminal access. That’s fine for developers — but if you want to make this workflow accessible to a broader team, or run it automatically on a schedule, you need something else.
That’s where MindStudio’s AI Media Workbench comes in. MindStudio is a no-code platform that lets you build and deploy AI agents and automated workflows — including media generation pipelines like this one.
Instead of running a terminal command, your team members can open a simple UI, type a topic prompt, and get a finished video back. MindStudio handles the orchestration layer — calling ElevenLabs, routing image generation, merging outputs — without anyone needing to manage API keys or run local scripts.
More specifically, MindStudio’s Agent Skills Plugin exposes typed method calls like agent.generateImage(), agent.runWorkflow(), and agent.generateAudio() that any AI agent — including Claude — can call directly. You can wire the same pipeline described in this article into a MindStudio workflow, then deploy it as a web app, an email-triggered agent, or a scheduled background job.
If your team is producing regular AI video content — weekly summaries, product demos, social clips — automating it through a platform like MindStudio removes the need for anyone to babysit a local script each time.
You can try MindStudio free at mindstudio.ai.
Customizing the Output for Different Use Cases
The default workflow produces a talking-head-style narrated video. But you can push it in different directions with config changes and prompt adjustments.
Short-form social content
Set video_length_target to 30–60 seconds and add aspect_ratio: 9:16 to config. This outputs vertical video suitable for TikTok, Instagram Reels, or YouTube Shorts. Adjust the visual_style_guide to favor bold, high-contrast frames that read well on mobile.
Product explainers
How Remy works. You talk. Remy ships.
For product demos, add a product_context field to your prompt with a description of the product. Claude Code will incorporate this into the script and visual descriptions — referencing the product specifically rather than generating generic footage.
Multilingual content
ElevenLabs supports over 30 languages. Set voice_language in config and write your prompt in the target language. Claude Code will generate the script in that language, and ElevenLabs will synthesize it with a native-sounding voice.
Branded content
Add a brand_config block to config with your color palette, logo path, and font preferences. HyperFrames can apply these consistently across frames, and the assembly step can add a branded intro/outro if you provide those clips.
FAQ
What does HyperFrames actually generate — images or video?
HyperFrames generates whatever your configured backend model supports. In static mode, it outputs images (one per script segment) that are assembled into a video by ffmpeg. In animated or clip mode, it outputs short video files per segment. The choice affects both generation time and visual quality — static frames are faster, clips look more dynamic.
Do I need a powerful computer to run this workflow?
It depends on your image generation setup. If you’re routing image generation to a hosted API (Replicate, Stability AI’s API, etc.), your local machine just needs to run Node.js or Python — even a basic laptop works. If you’re running HyperFrames with a local model via Ollama or ComfyUI, you’ll want a GPU. ElevenLabs and Claude Code both run via API, so they don’t require local compute.
How much does it cost to generate one video?
Costs vary by model choice and video length, but a rough estimate for a 90-second video:
- ElevenLabs: ~$0.10–$0.30 depending on character count and plan
- Image generation: $0.01–$0.05 per frame on hosted APIs
- Claude Code API calls: typically under $0.10 for script generation and orchestration
Total for a typical 90-second video: under $1.00 with hosted APIs. Local models bring the image generation cost to zero.
Can I use this for commercial content?
Check the terms of service for each tool you’re using. ElevenLabs allows commercial use on paid plans. Image model licensing varies — some open-source models have non-commercial restrictions. Claude’s API terms permit commercial use. Always verify before distributing content.
How do I improve the quality of the generated visuals?
Three things make the biggest difference: better visual prompts, a higher-quality image model, and consistent style guidance. Add a detailed visual_style_guide to your config, use a model trained on high-quality outputs (FLUX Pro or similar), and avoid vague visual descriptions in your prompts. Specificity in the visual description directly correlates with frame quality.
Can I regenerate only one segment without re-running everything?
Yes. The workflow stores intermediate outputs (script, audio files, frames) in the session directory. Claude Code supports a --regenerate-segment [N] flag that reruns generation for a specific segment and reassembles the final video. This is significantly faster than a full re-run.
Key Takeaways
- The Claude Code + HyperFrames + ElevenLabs workflow automates the full AI video pipeline: scripting, frame generation, voiceover synthesis, and final assembly
- The five-stage process (ingestion → frames → voiceover → sync → output) can produce a finished video in under 15 minutes from a single prompt
- ElevenLabs word-level timestamp data is what enables audio-visual sync — it’s not just a text-to-speech call
- Common issues (sync drift, slow generation, bad visuals) have straightforward fixes once you know where to look
- For teams who want this workflow without terminal access, MindStudio can wrap the same pipeline into a deployable no-code agent — accessible to anyone on your team
Plans first. Then code.
Remy writes the spec, manages the build, and ships the app.
If you’re building AI content workflows and want a platform that handles orchestration, integrations, and deployment without requiring DevOps, MindStudio is worth a look.