How to Build an AI Video Generation Workflow with Claude Code and HyperFrames
Learn how to generate fully automated YouTube Shorts with audio, animation, and transitions using Claude Code, HyperFrames, and ElevenLabs.
What You’re Actually Building Here
Automated video generation has moved from novelty to practical territory. With the right combination of tools, you can build a pipeline that takes a text prompt and outputs a finished YouTube Short — complete with voiceover, animated visuals, and transitions — without touching a video editor.
This guide walks through an AI video generation workflow using Claude Code as the orchestration engine, HyperFrames for animated visual output, and ElevenLabs for voice synthesis. The result is a repeatable system you can run on demand or put on a schedule.
Whether you’re building a content automation system, an internal production tool, or just experimenting with agentic media workflows, this stack is a practical starting point.
What Each Tool Does in This Stack
Before wiring things together, it helps to understand what each tool actually contributes — and where its limits are.
Claude Code
Claude Code is Anthropic’s agentic CLI tool. It runs in your terminal and can write, execute, debug, and iterate on code autonomously. Unlike a standard coding assistant, Claude Code can take multi-step instructions, run commands, inspect outputs, and course-correct — making it well suited to orchestrating complex workflows where multiple APIs need to coordinate.
In this stack, Claude Code serves as the brain: it writes the script content, calls APIs in the right sequence, handles file management, and stitches the final output together.
HyperFrames
Built like a system. Not vibe-coded.
Remy manages the project — every layer architected, not stitched together at the last second.
HyperFrames is a video generation framework built for programmatic animation. It works by defining visual scenes through structured data — keyframes, transitions, overlays — and rendering them to video. Rather than generating a single long video clip from a prompt (which models still struggle with for short-form content), HyperFrames lets you build a video scene-by-scene with predictable, composable outputs.
This matters because YouTube Shorts depend on pacing, text overlays, and rapid cuts — things that are hard to control with a single model call but straightforward to define programmatically.
ElevenLabs
ElevenLabs handles text-to-speech conversion. It produces natural-sounding voiceovers from a script, with support for different voices, speaking rates, and emotional tones. Its API is straightforward: send text, get back an audio file. The output quality is high enough that most viewers won’t distinguish it from a human recording.
Prerequisites Before You Start
This workflow assumes you have a few things in place. Set these up before running any code:
- Claude Code installed — Requires an Anthropic API key. Install it via
npm install -g @anthropic-ai/claude-code. - HyperFrames — Install via
pip install hyperframesand have FFmpeg available on your system path. - ElevenLabs API key — Sign up at ElevenLabs and generate an API key from your dashboard.
- Python 3.10+ — The workflow scripts run in Python.
- Basic terminal familiarity — You don’t need to write code from scratch, but you’ll be running commands and editing config files.
Set your API keys as environment variables so they’re available across sessions:
export ANTHROPIC_API_KEY="your-key-here"
export ELEVENLABS_API_KEY="your-key-here"
Step 1: Define the Workflow Architecture
The production pipeline runs in four sequential stages:
- Script generation — Claude writes a voiceover script based on your topic input
- Audio generation — ElevenLabs converts the script to an MP3
- Visual generation — HyperFrames renders a sequence of animated scenes timed to the audio
- Assembly — FFmpeg merges audio and video into a final MP4
Each stage writes its output to a temp directory, and the next stage reads from it. This makes debugging easier — if the audio is off, you can regenerate it without rerunning the visual stage.
Create your project structure:
/video-pipeline
/scripts ← generated scripts
/audio ← ElevenLabs MP3 output
/frames ← HyperFrames scene output
/output ← final assembled MP4s
pipeline.py ← main orchestration file
config.yaml ← topic, voice, style settings
Step 2: Generate the Script with Claude
Open your terminal in the project directory and launch Claude Code:
claude
Give it a task like this:
Write a Python function called generate_script(topic, duration_seconds) that:
- Uses the Anthropic API to generate a YouTube Shorts script on the given topic
- Targets the specified duration (assume ~2.5 words per second for speech)
- Returns a dict with keys: "title", "script", "scenes" where scenes is a list of
short visual cues (one per 5-8 seconds of content)
- Saves the output as JSON to ./scripts/{topic_slug}.json
Claude Code will write the function, test it, and fix any issues. A generated script object looks like this:
{
"title": "Why Sleep Debt Is Real",
"script": "Most people think you can catch up on sleep over the weekend. You can't. Sleep debt accumulates and affects memory, mood, and metabolism...",
"scenes": [
"Person staring at phone at 2am, blue light, dark room",
"Split brain graphic — left side sharp, right side blurry",
"Calendar showing weekend, zzz icons, still-tired face",
"Data chart showing cumulative sleep deficit over a week"
]
}
Plans first. Then code.
Remy writes the spec, manages the build, and ships the app.
The scenes array is what HyperFrames will use to render each visual segment.
Step 3: Convert the Script to Audio
With the script JSON saved, have Claude Code write the audio generation step:
Write a function called generate_audio(script_path, voice_id) that:
- Loads the JSON from script_path
- Sends the script text to the ElevenLabs API using the requests library
- Uses the voice_id parameter and a speaking_rate of 1.1
- Saves the MP3 to ./audio/{topic_slug}.mp3
- Returns the audio duration in seconds
Claude Code will handle the ElevenLabs API call. The key parameters to know:
voice_id: Each voice in ElevenLabs has a unique ID. You can browse available voices in their dashboard and hardcode a preferred one in yourconfig.yaml.model_id: Useeleven_multilingual_v2for the best quality output.speaking_rate: Values between 0.9 and 1.2 work best for Shorts content — fast enough to feel energetic, slow enough to be understood.
After running this step, you’ll have an MP3 in your /audio directory. Play it back and check the pacing before moving on.
Step 4: Generate Animated Visuals with HyperFrames
This is where the bulk of the visual work happens. HyperFrames works through a scene definition model: you describe each visual segment, and it renders them to individual clips that can be concatenated.
Ask Claude Code to write the visual generation step:
Write a function called generate_visuals(script_path, audio_duration) that:
- Loads the scenes list from the script JSON
- Divides the audio_duration evenly across the number of scenes
- For each scene, creates a HyperFrames Scene object with:
- A text overlay from the scene description
- A background color or gradient (alternate between dark and light)
- A simple animation (fade-in, slide-up, or zoom-in, rotated through scenes)
- Duration matching the per-scene time slice
- Exports each scene as a separate MP4 to ./frames/
- Returns a list of frame file paths
HyperFrames scene definitions look roughly like this in Python:
from hyperframes import Scene, TextOverlay, Animation
scene = Scene(
duration=6.5,
background="#0f0f0f",
animation=Animation.FADE_IN
)
scene.add_overlay(TextOverlay(
text="Sleep debt accumulates.",
font_size=48,
position="center",
color="#ffffff"
))
scene.render("./frames/scene_01.mp4")
For YouTube Shorts, set the output resolution to 1080x1920 (vertical). HyperFrames supports this via a resolution parameter on the Scene constructor.
Adding Transitions Between Scenes
Smooth transitions matter for watch time. HyperFrames supports transition clips between scenes. Have Claude Code add a transition layer:
Modify the generate_visuals function to insert a 0.3-second crossfade transition
between each scene clip. Use HyperFrames' Transition.CROSSFADE type.
Save transition clips to ./frames/transitions/.
The final frame directory should have alternating scene and transition files, named sequentially so FFmpeg can concatenate them in order.
Step 5: Assemble the Final Video
With audio and visual clips ready, Claude Code handles the final assembly:
Write a function called assemble_video(frame_paths, audio_path, output_name) that:
- Creates an FFmpeg concat list from frame_paths in order
- Merges the concatenated video with the audio file
- Outputs to ./output/{output_name}.mp4
- Ensures the video is exactly as long as the audio (trim or pad if needed)
- Adds a subtitle track by extracting text from the scene overlays
Day one: idea. Day one: app.
Not a sprint plan. Not a quarterly OKR. A finished product by end of day.
The FFmpeg command Claude Code generates will look something like:
ffmpeg -f concat -safe 0 -i frames_list.txt -i audio/topic.mp3 \
-c:v libx264 -c:a aac -shortest output/topic.mp4
The -shortest flag ensures the video doesn’t extend beyond the audio track, which prevents awkward silent endings.
Step 6: Wire It All Together in a Single Pipeline
Now have Claude Code write the orchestration entry point:
Write a main() function in pipeline.py that:
- Reads topic, voice_id, and output_name from config.yaml
- Calls generate_script, generate_audio, generate_visuals, and assemble_video in sequence
- Logs progress at each stage
- Catches and reports errors without crashing the entire pipeline
- Prints the final output path when complete
Your config.yaml looks like this:
topic: "Why sleep debt is real"
voice_id: "21m00Tcm4TlvDq8ikWAM"
output_name: "sleep-debt-short"
duration_target: 55
style:
primary_color: "#0f0f0f"
accent_color: "#4ade80"
font: "Inter"
Run the full pipeline with:
python pipeline.py
A 55-second Short takes roughly 2–4 minutes to generate end-to-end on a standard machine, depending on API response times.
Common Problems and How to Fix Them
Audio and video are out of sync
This usually means the script was too long for the target duration. Fix it by tightening the prompt: add a word count constraint to generate_script based on duration_target * 2.5.
HyperFrames crashes on vertical resolution
Some older versions of HyperFrames default to landscape output. Make sure you’re on the latest version (pip install --upgrade hyperframes) and explicitly pass resolution=(1080, 1920) to each Scene constructor.
ElevenLabs returns a 429 error
You’ve hit the rate limit. The free tier allows a limited number of characters per month. Add a time.sleep(1) between API calls, or upgrade your ElevenLabs plan if you’re running in bulk.
Claude Code rewrites files it shouldn’t
Claude Code is aggressive about editing files. Use a .claude_ignore file (similar to .gitignore) to protect generated outputs you want to keep across runs.
FFmpeg concat order is wrong
This happens when frame filenames aren’t zero-padded (e.g., scene_1 sorts after scene_10). Rename files with zero-padded numbers (scene_001, scene_002) when saving them in the generate_visuals function.
How MindStudio Fits Into This Workflow
The pipeline described above works well as a developer project, but it has a ceiling: every new topic requires someone to run a terminal command, manage environment variables, and monitor the output.
If you want this running as a self-service tool or a scheduled content engine, that’s where MindStudio’s AI Media Workbench becomes useful.
MindStudio’s media workbench gives you access to image and video generation models — plus 24+ media tools including clip merging, subtitle generation, and audio overlay — all in one place, without managing infrastructure. You can chain those tools into a workflow that accepts a topic as input and delivers a finished video without any terminal access.
The difference from the Claude Code approach is deployment and access. The pipeline you build in this guide lives on your machine. A MindStudio workflow can be shared as a link, triggered via webhook, or run on a schedule — meaning your non-technical teammates can use it too.
The MindStudio Agent Skills Plugin also lets Claude Code call MindStudio capabilities directly as method calls (agent.generateVideo(), agent.mergeClips()), so you can keep Claude Code as the orchestration layer while offloading the heavy media work to MindStudio’s infrastructure.
You can try MindStudio free at mindstudio.ai.
Scaling and Extending the Workflow
Once the basic pipeline works, there are several directions you can take it:
Batch generation — Modify pipeline.py to read a list of topics from a CSV and process them sequentially. Add a delay between runs to stay within API rate limits.
Custom voice cloning — ElevenLabs supports voice cloning from a short audio sample. If you’re building a branded content channel, clone a consistent voice and use that voice ID across all generated videos.
Dynamic visuals — Instead of generated text overlays, integrate an image generation model (FLUX or DALL-E) to create scene-specific images. HyperFrames can use image files as backgrounds instead of flat colors.
Automated upload — The YouTube Data API supports programmatic video upload. Claude Code can write that integration too — add a upload_to_youtube(output_path, title, description, tags) step to complete the end-to-end loop.
Analytics feedback — Pull YouTube Studio analytics for previous Shorts and feed retention data back into the script generation prompt. Topics with higher watch time get used as examples in future prompts.
Frequently Asked Questions
What is HyperFrames and how does it differ from other video generation tools?
HyperFrames is a programmatic video generation framework that builds video from structured scene definitions — keyframes, overlays, transitions — rather than generating video from a free-form prompt. This makes it more predictable and composable than diffusion-based video models for short-form content where timing, text placement, and scene pacing need to be precise.
Can Claude Code fully automate the video pipeline without human input?
Yes, once the pipeline is set up. Claude Code can write and execute all four stages — script, audio, visuals, assembly — in sequence. The main human input is the initial topic and config settings. After that, the pipeline runs autonomously to a finished MP4.
How much does it cost to generate a YouTube Short with this stack?
Rough per-video estimates at current pricing: Claude API (script generation) is a few cents per call. ElevenLabs (55 seconds of audio) runs around $0.01–$0.05 depending on your plan tier. HyperFrames is open source and free. FFmpeg is free. Total cost per Short is typically under $0.10 at moderate volume.
What’s the output quality like compared to human-produced content?
The voiceover quality is high — ElevenLabs outputs are consistently clear and natural-sounding. The visuals from HyperFrames are clean but design-forward rather than cinematic; they work well for educational or information-dense content. For content that requires photorealistic footage or character animation, you’d need to integrate a video generation model as a background layer.
Can this workflow run on a schedule without manual triggering?
The pipeline as written requires a manual python pipeline.py command. To run it on a schedule, you can use cron on Linux/Mac or Task Scheduler on Windows, or wrap it in a MindStudio agent configured to run on a daily or weekly trigger.
Is this approach viable for channels at scale, or just for experimentation?
Remy doesn't write the code. It manages the agents who do.
Remy runs the project. The specialists do the work. You work with the PM, not the implementers.
It’s viable for real content operations, with caveats. The pipeline produces consistent output quality, but YouTube’s algorithm still rewards originality and engagement signals. Bulk-generated content can work for educational niches where information density matters more than production style. For entertainment or personality-driven content, the human element is harder to replicate.
Key Takeaways
- The Claude Code + HyperFrames + ElevenLabs stack creates a fully automated short-form video pipeline from a single topic input.
- Claude Code handles orchestration — writing, executing, and debugging each stage without manual code changes.
- HyperFrames gives you control over visual pacing, scene timing, and transitions that free-form video generation models don’t offer.
- ElevenLabs produces consistent, high-quality audio without recording equipment or voice talent.
- The pipeline costs under $0.10 per video and can be extended with batch processing, custom voices, AI-generated images, and automated YouTube upload.
- MindStudio’s AI Media Workbench can take this workflow from a local script to a shareable, schedulable content engine accessible to anyone on your team.