How to Generate AI Videos with Claude Code, HyperFrames, and ElevenLabs

Q: How do I make the video longer or add more scenes?

Edit script.json to add new scene objects, then re-run the full pipeline. Because each step reads from the same JSON file, adding scenes is additive — existing rendered files won't be overwritten unless you delete them. For very long videos (10+ minutes), consider batching scene generation to stay within API context limits.

A Different Way to Build AI Videos

Producing a polished AI video usually means juggling half a dozen browser tabs, wrestling with timelines, and paying per minute for tools that don’t talk to each other. There’s a better path: use Claude Code to write and orchestrate the whole thing, HyperFrames to render HTML-based scenes into video frames, and ElevenLabs to generate a natural-sounding voiceover — all from a single agentic workflow.

This guide walks through each step of that AI video generation process, from project setup to final export. It’s aimed at developers and technical builders who’d rather write a prompt than drag a slider.

Why These Three Tools Work Well Together

Each tool handles a distinct layer of the problem.

Claude Code is Anthropic’s agentic coding environment. It can write scripts, generate HTML and CSS, call APIs, run shell commands, and iterate based on output — all without you switching context. For video production, that means it can author the narrative, write every scene’s HTML animation, coordinate file I/O, and stitch together the pipeline.

HyperFrames is a programmatic video rendering library built on the idea that HTML and CSS are actually great for animation. Instead of a traditional video editor, you define scenes as web components — using CSS transitions, SVG, canvas, or JavaScript animations — and HyperFrames captures them frame by frame using a headless browser. The output is a sequence of image frames or a compiled video clip, depending on configuration.

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

ElevenLabs provides high-quality AI voice synthesis via API. You send it text, pick a voice, and get back an audio file. Response times are fast enough to fit inside an automated pipeline, and the output quality is good enough that many viewers can’t distinguish it from a human recording.

Together, the three tools cover script → visuals → audio, which is the full production stack for most short-form or explainer-style AI videos.

Prerequisites

Before starting, make sure you have:

Claude Code installed and authenticated (npm install -g @anthropic-ai/claude-code or via Anthropic’s site)
Node.js 18+ on your machine
HyperFrames installed (npm install hyperframes)
FFmpeg installed and available in your PATH (for merging audio and video)
An ElevenLabs API key (the free tier works for testing)
Basic familiarity with running shell commands and reading JavaScript

You don’t need deep frontend experience. Claude Code will write most of the HTML and CSS for you.

Step 1: Define Your Video Script with Claude Code

Start a Claude Code session in a new project directory. Your first task is to generate a structured script — not a prose document, but a machine-readable JSON format that drives the rest of the pipeline.

Open Claude Code and give it a prompt like:

Write a JSON script for a 60-second explainer video about how photosynthesis works. 
Each scene should have: a "scene_id", a "duration" in seconds, a "narration" string, 
and a "visual_description" string. Output only valid JSON.

Claude Code will return something like:

[
  {
    "scene_id": "intro",
    "duration": 8,
    "narration": "Every plant you see is a solar-powered factory...",
    "visual_description": "Animated sun with rays extending toward a green leaf"
  },
  {
    "scene_id": "light_absorption",
    "duration": 12,
    "narration": "Chlorophyll molecules inside the leaf capture light energy...",
    "visual_description": "Zoomed-in cell diagram showing chloroplasts glowing"
  }
]

Save this as script.json. This file becomes the single source of truth for every subsequent step.

Why JSON Over a Text Script

A text script is readable by humans. A JSON script is readable by your pipeline. Every downstream tool — HyperFrames, ElevenLabs, FFmpeg — can be driven from this same structure without manual reformatting.

Step 2: Generate HTML Scenes with Claude Code

For each scene in script.json, you need an HTML file that Claude Code will generate based on the visual_description field.

Ask Claude Code to write a Node.js script that loops through script.json and generates one HTML file per scene:

Write a Node.js script called generate-scenes.js. It should:
1. Read script.json
2. For each scene, call Claude's API to generate an HTML animation based on the scene's visual_description
3. Save each result to scenes/{scene_id}.html
4. The HTML should be self-contained (inline CSS and JS), 1920x1080, dark background

Claude Code will write generate-scenes.js. When you run it (node generate-scenes.js), it makes one API call per scene and writes the HTML files to a scenes/ folder.

What Good Scene HTML Looks Like

Each generated HTML file should be a single, standalone page. Claude Code tends to produce clean output here — smooth CSS keyframe animations, SVG illustrations, or canvas-based motion. The key constraint is that the animation must be deterministic and not depend on user interaction, because HyperFrames will be capturing frames at a fixed interval.

Other agents start typing. Remy starts asking.

YOU SAID "Build me a sales CRM."

REMY ASKS

01 DESIGN Should it feel like Linear, or Salesforce?

02 UX How do reps move deals — drag, or dropdown?

03 ARCH Single team, or multi-org with permissions?

Scoping, trade-offs, edge cases — the real work. Before a line of code.

If a scene looks static when you preview it in a browser, add this prompt to your generation step: The animation should start immediately on page load, loop once, and complete within {duration} seconds.

Step 3: Render Scenes to Video with HyperFrames

HyperFrames uses a headless Chromium instance (via Puppeteer under the hood) to open each HTML file and capture frames at a set frame rate. Those frames are compiled into a video clip.

Have Claude Code write a render-scenes.js script:

Write a Node.js script called render-scenes.js using the hyperframes library.
It should:
1. Read script.json to get scene IDs and durations
2. For each scene, render scenes/{scene_id}.html at 30fps for the scene's duration
3. Output a video file to rendered/{scene_id}.mp4
4. Use 1920x1080 resolution

The resulting script will look roughly like this (Claude Code will fill in the exact HyperFrames API calls):

const HyperFrames = require('hyperframes');
const script = require('./script.json');
const path = require('path');

async function renderAll() {
  for (const scene of script) {
    const renderer = new HyperFrames({
      input: path.resolve(`scenes/${scene.scene_id}.html`),
      output: `rendered/${scene.scene_id}.mp4`,
      fps: 30,
      duration: scene.duration,
      width: 1920,
      height: 1080
    });
    await renderer.render();
    console.log(`Rendered: ${scene.scene_id}`);
  }
}

renderAll();

Run node render-scenes.js and you’ll get one .mp4 file per scene in the rendered/ folder. This step takes the longest — rendering is CPU-bound, and a 60-second video at 30fps across 6 scenes might take several minutes depending on animation complexity.

Troubleshooting Render Issues

Blank frames: The HTML animation may start before Chromium finishes loading. Add a 500ms delay in HyperFrames’ start options.
Fonts missing: System fonts may not load in headless mode. Use Google Fonts with an @import in the CSS, or base64-encode fonts inline.
Slow renders: Reduce frame rate to 24fps for shorter render time. Most viewers won’t notice the difference in this type of content.

Step 4: Generate Voiceover Audio with ElevenLabs

With your scenes rendered, switch to audio. Have Claude Code write an generate-audio.js script that calls the ElevenLabs API for each scene’s narration:

Write a Node.js script called generate-audio.js that:
1. Reads script.json
2. For each scene, calls the ElevenLabs text-to-speech API using the narration field
3. Uses voice ID "21m00Tcm4TlvDq8ikWAM" (Rachel) and model "eleven_monolingual_v1"
4. Saves audio to audio/{scene_id}.mp3
5. Uses the ELEVENLABS_API_KEY environment variable for auth

Claude Code will produce a script using node-fetch or axios to hit ElevenLabs’ /v1/text-to-speech/{voice_id} endpoint. The response is a binary audio buffer — save it directly to disk.

Run ELEVENLABS_API_KEY=your_key node generate-audio.js and you’ll get one .mp3 per scene.

Choosing the Right Voice

ElevenLabs offers a large library of voices, each with different characteristics. For explainer-style videos:

Rachel — clear, neutral, good pacing
Adam — slightly warmer, works well for storytelling
Josh — more casual, better for social-first content

You can also clone a voice or use ElevenLabs’ voice design feature if you need something specific. Pass the voice ID as an environment variable so it’s easy to swap.

Step 5: Merge Audio and Video with FFmpeg

Each scene now exists as two files: a .mp4 video clip and an .mp3 audio clip. Claude Code can write the FFmpeg commands to merge them scene by scene, then concatenate all scenes into a final video.

Ask Claude Code:

Write a shell script called assemble.sh that:
1. For each scene in script.json, merges rendered/{scene_id}.mp4 with audio/{scene_id}.mp3 into merged/{scene_id}.mp4
2. Creates a concat.txt file listing all merged clips
3. Concatenates all clips into final_output.mp4 using FFmpeg's concat demuxer

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

The merge command per scene looks like:

ffmpeg -i rendered/intro.mp4 -i audio/intro.mp3 \
  -c:v copy -c:a aac -shortest merged/intro.mp4

The final concat step:

ffmpeg -f concat -safe 0 -i concat.txt -c copy final_output.mp4

Run bash assemble.sh and you’ll find final_output.mp4 ready to review.

Step 6: Iterate and Refine

The real advantage of this pipeline is that every layer is independently adjustable. You don’t need to re-render everything if one scene looks off.

Narration feels rushed? Edit the narration field in script.json, re-run generate-audio.js for that scene, and re-merge.
Visual looks wrong? Re-prompt Claude Code to rewrite that specific scene’s HTML, re-render just that scene with HyperFrames, and re-merge.
Timing is off? Adjust the duration field, re-render and re-generate audio for that scene.

Because everything is driven from script.json and each step is idempotent, iteration is fast. Claude Code can also help you debug — paste an error message directly into the session and it will fix the relevant script.

How MindStudio Fits Into This Workflow

The pipeline above works well as a one-time build, but what if you want to run it repeatedly — generating videos on a schedule, triggered by a form submission, or as part of a larger content operation?

That’s where MindStudio’s AI Media Workbench comes in. MindStudio provides a visual workflow builder that can wrap the same steps you built manually into a repeatable, automated agent. Instead of running scripts from your terminal, you configure a workflow that:

Accepts a topic or script as input (via form, API, or webhook)
Calls Claude to generate the structured JSON script
Generates HTML scenes, renders them, and synthesizes audio using connected services
Outputs the finished video — or sends it to a Slack channel, uploads it to Google Drive, or posts it to a CMS

MindStudio has native support for ElevenLabs voice generation and access to 200+ AI models, so you don’t need to manage API keys or stitch together separate connections. You can also add a human review step between script generation and rendering, which is useful if you’re producing branded content.

For Claude Code users who want to extend their agent with MindStudio capabilities, the Agent Skills Plugin (@mindstudio-ai/agent) exposes methods like agent.generateVideo() and agent.runWorkflow() as typed function calls. That means you can keep your Claude Code setup and add MindStudio’s media tools as a skill layer — no full migration required.

Try MindStudio free at mindstudio.ai to see how the workflow builder handles the orchestration layer.

Common Mistakes to Avoid

Over-Complicated HTML Scenes

Claude Code sometimes generates animations that are visually impressive but slow to render. If HyperFrames takes more than 10 seconds per frame, simplify the scene. Avoid heavy particle systems, large unoptimized SVGs, or DOM-heavy animations.

Mismatched Audio and Video Length

ElevenLabs doesn’t guarantee that a given narration will fit exactly within the scene’s duration. Build in a 10–15% buffer on the duration field, and use FFmpeg’s -shortest flag when merging to avoid silent gaps at the end.

Not Validating JSON Output

Claude Code occasionally returns JSON with trailing commas or comments, which will break require('./script.json'). Always run your JSON through a validator (or use JSON5 if you want to allow comments) before feeding it to the pipeline.

Skipping the Preview Step

Always preview each .html file in a browser before running HyperFrames on it. A broken animation renders as blank frames, and you won’t catch the issue until the video is assembled.

Frequently Asked Questions

What is HyperFrames and how does it work?

HyperFrames is a Node.js library for programmatic video rendering from HTML and CSS content. It launches a headless Chromium browser, opens your HTML file, and captures screenshots at a fixed frame rate (typically 24–60fps). Those screenshots are compiled into a video file using FFmpeg internally. It’s particularly useful for data visualizations, animated infographics, and UI recordings that are easier to build as web content than with traditional video editing tools.

Can I use Claude Code for video generation without HyperFrames?

Yes, but you’ll need a different rendering method. Alternatives include Puppeteer with manual frame capture, Remotion (a React-based video framework), or pre-built video APIs. HyperFrames simplifies the headless browser setup, but the core idea — render HTML scenes into video — can be implemented several ways. Claude Code can write the integration code for any of these.

How much does this workflow cost to run?

Costs depend on video length and the ElevenLabs plan you use. A 60-second video with 6 scenes requires roughly 6 Claude API calls for HTML generation (moderate cost), 6 ElevenLabs API calls for audio (low cost on most plans), and local compute for HyperFrames rendering (free, but time-intensive). For most short videos, the total API cost is under $1.

Is the video quality good enough for professional use?

That depends on the HTML animations Claude Code generates. For data-driven explainers, product walkthroughs, or educational content, the output is often publication-ready. For cinematic or live-action footage, this pipeline isn’t the right tool — look at dedicated AI video generation models like Sora or Veo for that use case.

Can I add music or sound effects to the final video?

Yes. FFmpeg supports mixing multiple audio tracks. Add a background music track to your assemble.sh script using FFmpeg’s -filter_complex amix option. Claude Code can write this for you if you describe what you want — for example, “add background.mp3 at 20% volume mixed under the voiceover.”

How do I make the video longer or add more scenes?

Edit script.json to add new scene objects, then re-run the full pipeline. Because each step reads from the same JSON file, adding scenes is additive — existing rendered files won’t be overwritten unless you delete them. For very long videos (10+ minutes), consider batching scene generation to stay within API context limits.

Key Takeaways

Claude Code is the orchestration layer — it writes the scripts, generates scene HTML, produces the assembly commands, and can debug the pipeline when things break.
HyperFrames converts web animations to video — any CSS, SVG, or canvas animation you can build in a browser can become a video clip.
ElevenLabs adds production-quality voiceover via a simple API call per scene.
FFmpeg handles the merge and concatenation — it’s the glue that turns individual clips into a finished video.
The whole pipeline is scriptable and repeatable — once it works, you can run it on any topic by editing a single JSON file.

For teams that want to run this workflow at scale or trigger it automatically, MindStudio’s AI Media Workbench wraps these capabilities into a visual workflow builder — no terminal required, and no API keys to manage separately.