Skip to main content
MindStudio
Pricing
Blog About
My Workspace

How to Generate AI Videos with Claude Code, HyperFrames, and ElevenLabs: Full Workflow

Generate complete YouTube Shorts end-to-end with AI: scripting, audio, animation, and sync. Here's the full workflow using Claude Code and HyperFrames.

MindStudio Team RSS
How to Generate AI Videos with Claude Code, HyperFrames, and ElevenLabs: Full Workflow

The Complete AI Video Production Workflow

Producing a YouTube Short used to mean juggling a script, a microphone, video editing software, and hours of your time. Now, the same output — a polished 60-second video with narration, animation, and synced audio — can be generated almost entirely by AI, with a human in the loop only to review and approve.

This guide walks through a full end-to-end workflow for generating AI videos using Claude Code for orchestration and scripting, ElevenLabs for voice synthesis, and HyperFrames for animated visuals. The result: a repeatable pipeline for creating YouTube Shorts (or similar short-form content) at scale without a production team.

Whether you’re automating content for a channel, building AI video tools for clients, or just exploring what’s possible, this workflow is a solid starting point.


What Each Tool Does in This Stack

Before getting into the steps, it helps to understand why these three tools work well together. Each one handles a distinct layer of the production process.

Claude Code

Claude Code is Anthropic’s agentic coding environment. It runs in your terminal and can write code, execute scripts, manage files, call APIs, and chain multi-step tasks together — all from natural language instructions.

In this workflow, Claude Code serves as the brain. It writes the video script, structures the narration into timed segments, generates prompts for the animation layer, and calls the ElevenLabs and HyperFrames APIs to trigger production. Think of it as the director coordinating the whole production pipeline.

ElevenLabs

Not a coding agent. A product manager.

Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.

BY MINDSTUDIO

ElevenLabs is an AI voice synthesis platform. It converts text to realistic audio using a library of pre-built voices — or custom cloned voices — with control over pacing, tone, and emotion. It offers an API that makes it easy to feed a script and get an audio file back in seconds.

In this workflow, ElevenLabs generates the narration track from the script Claude Code produces.

HyperFrames

HyperFrames is an AI animation and frame generation tool designed for producing video sequences from structured prompts. It takes scene descriptions and timing data, generates individual frames using image models, and compiles them into a coherent animated video. It’s especially well-suited for short-form content where you want stylized visuals without needing actual footage.

In this workflow, HyperFrames receives per-scene prompts from Claude Code and produces the visual layer that gets synced to the ElevenLabs audio.


Prerequisites and Setup

Before running anything, you’ll need accounts and API access for all three tools.

What you need:

  • Claude Code installed (available via Anthropic’s developer tools)
  • An ElevenLabs account with API key (free tier works for prototyping)
  • HyperFrames account and API access
  • FFmpeg installed locally (for final audio/video merge)
  • Python 3.10+ or Node.js depending on your scripting preference
  • A working directory where your project files will live

Set up a .env file in your project directory with your API keys:

ELEVENLABS_API_KEY=your_key_here
HYPERFRAMES_API_KEY=your_key_here

Keep this file out of version control. Claude Code can read from environment variables at runtime, so you won’t need to hardcode anything.


Step 1: Script Generation with Claude Code

The first step is producing a script — the backbone of the entire video.

Open your terminal and launch Claude Code in your project directory. Give it a clear instruction about what you want:

Write a 60-second YouTube Shorts script about [your topic]. 
Structure it as JSON with the following fields per segment: 
id, narration_text, duration_seconds, visual_description, visual_style.

Claude Code will generate a structured JSON script that looks something like this:

[
  {
    "id": 1,
    "narration_text": "Most people waste two hours a day on tasks they could automate in five minutes.",
    "duration_seconds": 4,
    "visual_description": "A person staring at a screen surrounded by stacks of paper",
    "visual_style": "flat illustration, muted tones, motion blur"
  },
  {
    "id": 2,
    "narration_text": "Here's the difference between working hard and working smart.",
    "duration_seconds": 3,
    "visual_description": "Split screen: one person running in circles, one person sitting calmly with a laptop",
    "visual_style": "flat illustration, bold colors, clean lines"
  }
]

The visual_description and visual_style fields will be used as prompts for HyperFrames. The narration_text will go to ElevenLabs. The duration_seconds fields let you calculate total runtime and sync timing later.

Tips for Better Scripts

  • Be specific about topic, tone, and target audience in your prompt
  • Request a hook in the first 3 seconds — short-form content lives or dies by the first impression
  • Ask Claude Code to keep each narration segment to one or two sentences maximum
  • Specify a visual style that’s consistent across all segments (e.g., “flat 2D illustration, muted color palette”) so the final video feels cohesive

If the output doesn’t match what you want, iterate. Ask Claude Code to revise specific segments, tighten the pacing, or adjust the visual descriptions. This is fast — a revision cycle takes under a minute.


Step 2: Audio Generation with ElevenLabs

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

Once your script JSON is saved (e.g., as script.json), the next step is generating the narration audio.

Ask Claude Code to write a script that:

  1. Reads script.json
  2. Loops through each segment
  3. Calls the ElevenLabs API with the narration_text for each segment
  4. Saves each audio file as audio_1.mp3, audio_2.mp3, etc.

Claude Code will produce something like this Python script and can execute it directly:

import json
import os
import requests

API_KEY = os.getenv("ELEVENLABS_API_KEY")
VOICE_ID = "your_chosen_voice_id"

with open("script.json") as f:
    segments = json.load(f)

for segment in segments:
    response = requests.post(
        f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}",
        headers={"xi-api-key": API_KEY, "Content-Type": "application/json"},
        json={
            "text": segment["narration_text"],
            "model_id": "eleven_monolingual_v1",
            "voice_settings": {"stability": 0.5, "similarity_boost": 0.75}
        }
    )
    with open(f"audio_{segment['id']}.mp3", "wb") as audio_file:
        audio_file.write(response.content)
    print(f"Generated audio for segment {segment['id']}")

Run this and you’ll have individual audio files for each segment in your project directory.

Choosing a Voice

ElevenLabs has dozens of pre-built voices and lets you browse them in the dashboard. For YouTube Shorts, voices that sound conversational and slightly upbeat tend to perform better than formal or flat ones. You can also clone a voice — useful if you’re building content for a brand with an established presenter.

Once you’ve picked a voice, grab its voice ID from the ElevenLabs dashboard and plug it into the script.


Step 3: Visual Generation with HyperFrames

With the audio files ready, it’s time to generate the visuals.

HyperFrames takes a prompt per scene along with timing data and produces a sequence of frames or a short animated clip. You’ll send it the visual_description and visual_style fields from your script JSON, plus the duration for each segment.

Ask Claude Code to generate and run a script that calls the HyperFrames API for each segment:

import json
import os
import requests

API_KEY = os.getenv("HYPERFRAMES_API_KEY")

with open("script.json") as f:
    segments = json.load(f)

for segment in segments:
    prompt = f"{segment['visual_description']}. Style: {segment['visual_style']}"
    
    response = requests.post(
        "https://api.hyperframes.ai/v1/generate",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={
            "prompt": prompt,
            "duration": segment["duration_seconds"],
            "format": "mp4",
            "resolution": "1080x1920"  # Vertical for YouTube Shorts
        }
    )
    
    data = response.json()
    # Download the generated clip
    clip_url = data["clip_url"]
    clip_response = requests.get(clip_url)
    
    with open(f"visual_{segment['id']}.mp4", "wb") as video_file:
        video_file.write(clip_response.content)
    
    print(f"Generated visual for segment {segment['id']}")

Note: The exact API schema will depend on the current HyperFrames API version. Claude Code can look up the docs and adjust the call accordingly — just tell it to check the HyperFrames API documentation before writing the script.

Getting Consistent Visuals

One challenge with multi-segment video is visual consistency. If each segment looks completely different, the video feels disjointed.

A few ways to address this:

  • Use the same visual_style string across all segments
  • Add a “seed” parameter if HyperFrames supports it (fixes certain random elements)
  • Use a consistent color palette in your prompts (e.g., “blue and orange color scheme throughout”)
  • Ask Claude Code to prepend a consistent style prefix to every prompt before sending to HyperFrames

Step 4: Sync Audio and Video

Now you have individual audio clips (audio_1.mp3, audio_2.mp3, etc.) and visual clips (visual_1.mp4, visual_2.mp4, etc.). The next step is merging audio and video for each segment, then concatenating everything into the final video.

Other agents ship a demo. Remy ships an app.

UI
React + Tailwind ✓ LIVE
API
REST · typed contracts ✓ LIVE
DATABASE
real SQL, not mocked ✓ LIVE
AUTH
roles · sessions · tokens ✓ LIVE
DEPLOY
git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

FFmpeg handles this cleanly. Ask Claude Code to write an FFmpeg pipeline:

import os
import json
import subprocess

with open("script.json") as f:
    segments = json.load(f)

merged_clips = []

# Step 1: Merge audio and video for each segment
for segment in segments:
    seg_id = segment["id"]
    output = f"merged_{seg_id}.mp4"
    
    subprocess.run([
        "ffmpeg", "-y",
        "-i", f"visual_{seg_id}.mp4",
        "-i", f"audio_{seg_id}.mp3",
        "-c:v", "copy",
        "-c:a", "aac",
        "-shortest",
        output
    ])
    
    merged_clips.append(output)
    print(f"Merged segment {seg_id}")

# Step 2: Concatenate all merged clips
with open("concat_list.txt", "w") as f:
    for clip in merged_clips:
        f.write(f"file '{clip}'\n")

subprocess.run([
    "ffmpeg", "-y",
    "-f", "concat",
    "-safe", "0",
    "-i", "concat_list.txt",
    "-c", "copy",
    "final_video.mp4"
])

print("Final video created: final_video.mp4")

Run this script, and Claude Code will produce final_video.mp4 — a complete short-form video with narration and animation, ready to review.


Step 5: Review, Refine, and Export

The pipeline produces a draft, not a finished product. Spend a few minutes reviewing final_video.mp4 and note anything that needs adjustment.

Common issues to check:

  • Audio timing: Does the narration feel rushed or too slow in any segment?
  • Visual match: Does the animation match what the narrator is saying?
  • Transitions: Do the cuts between segments feel abrupt?
  • Consistency: Does the overall visual style feel cohesive?

For audio timing issues, ask Claude Code to adjust the duration_seconds values in script.json and re-run the relevant steps.

For visual improvements, refine the visual_description prompts for specific segments and re-run the HyperFrames generation for those segments only. You don’t need to regenerate everything from scratch.

For transitions, FFmpeg supports fade and dissolve effects between clips — ask Claude Code to add transition filters to the concat step.

Once you’re satisfied, the video is ready to upload.


Scaling This Workflow

Running this once for a single video is useful. Running it repeatedly — for a content calendar, multiple channels, or client deliverables — is where the real value appears.

A few ways to scale:

Batch scripting: Give Claude Code a list of topics (e.g., from a spreadsheet or text file) and have it generate scripts for all of them in one pass. Each topic produces a separate script.json.

Template-based generation: Define a reusable script template (intro hook, three-point body, call to action) and have Claude Code populate it with different content each time.

Automated triggers: Set up a cron job or webhook that runs the full pipeline on a schedule or in response to an event (e.g., a new topic added to a spreadsheet).

Voice variation: Use different ElevenLabs voices for different series or brands, all managed from the same pipeline with a variable swap.


How MindStudio Fits Into This Workflow

The pipeline above works well from a terminal. But if you want to share it with a team, trigger it without opening a terminal, or chain it into a larger content operation, that’s where MindStudio becomes relevant.

MindStudio’s AI Media Workbench gives you access to the major image and video generation models in one place, with 24+ built-in media tools for tasks like subtitle generation, clip merging, upscaling, and background removal. Instead of wiring up API calls manually, you can build the same audio/video pipeline as a no-code workflow that anyone on your team can run with a click.

More directly: MindStudio’s Agent Skills Plugin (@mindstudio-ai/agent) lets Claude Code call MindStudio’s capabilities as simple method calls. That means your Claude Code agent can call agent.generateVideo() or agent.runWorkflow() to trigger production steps that would otherwise require managing multiple APIs separately.

For teams producing content at volume — or developers building AI video tools for clients — this can significantly reduce the infrastructure overhead. You can try MindStudio free at mindstudio.ai.


Frequently Asked Questions

What is Claude Code and how does it differ from the Claude chatbot?

Claude Code is an agentic development tool that runs in your terminal. Unlike the Claude chat interface, it can write and execute code, read and modify files, and make API calls autonomously. It’s designed for multi-step, action-oriented tasks — making it well-suited for orchestrating a production pipeline like the one described here.

Do I need coding experience to use this workflow?

Some familiarity with running terminal commands and editing code helps, but Claude Code does most of the writing. The main requirements are setting up the tools, providing clear instructions, and reviewing the output. If you can follow a tutorial and run a script, you can manage this workflow.

How much does this cost to run?

Costs vary based on usage:

  • ElevenLabs: Free tier includes a limited number of characters per month. Paid plans start at around $5/month.
  • HyperFrames: Pricing depends on the number of generations and clip duration. Check their current pricing page for details.
  • Claude Code: Billed per token through Anthropic’s API. A typical script generation plus orchestration pass costs a few cents.
  • FFmpeg: Free and open source.

For a single 60-second YouTube Short, total API costs are typically under $1 at current pricing.

Can I use a custom voice with ElevenLabs?

Yes. ElevenLabs supports voice cloning from a short audio sample. Once a custom voice is created, you use its voice ID in the API call just like any built-in voice. This is useful for brand consistency or for creators who want their own voice in AI-generated content without recording each script manually.

How long does the full pipeline take to run?

For a 60-second video with 10–15 segments, expect:

  • Script generation: 30–60 seconds
  • Audio generation: 1–2 minutes (parallel calls speed this up)
  • Visual generation: 3–8 minutes (the most time-intensive step)
  • Final merge: under a minute

Total: roughly 5–12 minutes from prompt to finished video, depending on API response times and video complexity.

Can this workflow produce videos longer than 60 seconds?

Yes. The pipeline is segment-based, so longer videos just mean more segments. Keep in mind that longer scripts mean more API calls and higher generation times. For content longer than 3–4 minutes, it’s worth building in a review checkpoint after scripting before triggering the full audio and visual generation.


Key Takeaways

  • Claude Code acts as the orchestrator — it writes the script, structures the data, and calls the APIs that produce audio and visuals.
  • ElevenLabs converts narration text to realistic audio using a simple REST API call per segment.
  • HyperFrames generates the animated visual layer from structured prompts derived from the script.
  • FFmpeg handles the final merge and concat, producing a single polished video file.
  • The workflow is repeatable and scalable — once the pipeline is set up, generating a new video means changing the input topic, not rebuilding from scratch.
  • MindStudio can extend the pipeline for teams that need a no-code interface, built-in media tools, or deeper workflow automation without managing multiple APIs.
TIME SPENT BUILDING REAL SOFTWARE
5%
95%
5% Typing the code
95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

If you’re looking to build a more complete AI content operation — beyond just video generation — MindStudio’s AI Media Workbench is worth exploring. It connects image, video, and audio generation with broader workflow automation in one place.

Presented by MindStudio

No spam. Unsubscribe anytime.