How to Build an AI Video Generation Workflow with Claude Code and HyperFrames

What This Stack Actually Does (And Why It’s Interesting)

Building an AI video generation workflow that produces polished, multi-scene content with voiceover and transitions used to require a lot of manual stitching. You’d generate images, animate them separately, record or synthesize audio, then spend time in an editor combining everything.

This guide covers a different approach: using Claude Code, HyperFrames, ElevenLabs, and Archon together as a coordinated pipeline that can take a concept and produce a finished video automatically. The primary keyword here — AI video generation workflow — describes exactly what this stack enables: a repeatable, automatable process for producing video content with minimal manual steps.

This isn’t a no-code tutorial. It’s aimed at developers and technical builders who want to understand how these tools fit together, what each one does, and how to wire them into something that actually runs end to end.

Understanding the Stack Before You Build

Each tool in this pipeline has a specific role. Getting clear on that before writing any code saves a lot of debugging later.

Claude Code: Your Workflow Brain

Claude Code is Anthropic’s agentic coding environment. It runs in your terminal and can read files, write code, execute commands, and iterate on its own output. In this pipeline, Claude Code acts as the intelligent layer that:

Parses your creative brief and breaks it into structured scene data
Generates prompts for each visual frame based on context
Coordinates calls to other tools in the right sequence
Handles error recovery when individual steps fail

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

Think of Claude Code less as a code editor and more as a collaborative agent that understands your intent and can translate it into working pipeline logic.

HyperFrames: Visual Composition and Animation

HyperFrames handles the visual layer of the pipeline. It takes structured frame data — scene descriptions, camera motion parameters, transition types, duration — and uses AI image and video models to produce animated clips.

The key advantage of HyperFrames in this context is its frame-level API. Rather than just submitting a single text prompt for an entire video, you can define individual frames with precise parameters: what’s in the shot, how the camera moves, how it transitions to the next scene. This gives you much more control over pacing and visual continuity than prompt-to-video tools that treat the whole clip as one generation job.

HyperFrames outputs individual clips that can then be assembled into a final video.

ElevenLabs: Voiceover and Audio

ElevenLabs provides the audio layer — specifically, text-to-speech synthesis with high-quality voice models. In this pipeline, ElevenLabs:

Generates a narration track from a script Claude Code produces
Synchronizes audio timing with scene duration data
Outputs an audio file that gets merged with the visual output

ElevenLabs has an API that makes it straightforward to call programmatically with text, voice ID, and output format parameters.

Archon: Orchestration

Archon is an open-source meta-agent framework built on LangGraph. Its role here is orchestration — coordinating the execution order of the other components, managing state across the pipeline, and handling retries and handoffs between steps.

Rather than writing a linear script that calls each tool in sequence, Archon lets you define the pipeline as a graph of nodes with conditional edges. This means the workflow can branch (for example, regenerating a frame if quality checks fail), run steps in parallel (audio generation and visual generation can happen simultaneously), and pass structured state between components.

Prerequisites

Before building, make sure you have the following:

Python 3.11+ installed
Node.js 18+ for any JavaScript components
Claude Code installed via npm (npm install -g @anthropic-ai/claude-code)
An Anthropic API key with access to Claude 3.5 Sonnet or higher
A HyperFrames account and API key
An ElevenLabs API key
Archon cloned from its GitHub repository and dependencies installed
Basic familiarity with LangGraph concepts (nodes, state, edges)

Set your API keys as environment variables before starting:

export ANTHROPIC_API_KEY=your_key_here
export HYPERFRAMES_API_KEY=your_key_here
export ELEVENLABS_API_KEY=your_key_here

Step 1: Define Your Pipeline State Schema

Archon manages state across all nodes in the graph. The first step is defining what that state looks like — what information gets passed from one step to the next.

Create a file called state.py:

from typing import TypedDict, List, Optional

class Scene(TypedDict):
    index: int
    description: str
    duration_seconds: float
    transition: str  # "cut", "fade", "dissolve"
    camera_motion: str  # "static", "pan_left", "zoom_in", etc.
    visual_prompt: str
    clip_path: Optional[str]

class VideoWorkflowState(TypedDict):
    brief: str
    title: str
    script: str
    voice_id: str
    scenes: List[Scene]
    audio_path: Optional[str]
    final_video_path: Optional[str]
    errors: List[str]

This schema is the contract between every node in your graph. Claude Code will populate scenes early in the pipeline, and each subsequent node reads from and writes to this shared state object.

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

Step 2: Build the Script and Scene Generation Node

This node is where Claude Code does its main work. Create nodes/generate_script.py:

import anthropic
import json
from state import VideoWorkflowState

client = anthropic.Anthropic()

def generate_script_node(state: VideoWorkflowState) -> VideoWorkflowState:
    prompt = f"""
    You are a video production assistant. Given the following brief, generate:
    1. A full narration script (150-300 words)
    2. A list of visual scenes as structured JSON

    Each scene should have:
    - index (int)
    - description (string, 1-2 sentences)
    - duration_seconds (float, based on script pacing)
    - transition (one of: cut, fade, dissolve)
    - camera_motion (one of: static, pan_left, pan_right, zoom_in, zoom_out)
    - visual_prompt (detailed image generation prompt)

    Brief: {state['brief']}

    Return your response as JSON with keys: "script", "title", "scenes"
    """

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=4096,
        messages=[{"role": "user", "content": prompt}]
    )

    parsed = json.loads(response.content[0].text)
    
    return {
        **state,
        "script": parsed["script"],
        "title": parsed["title"],
        "scenes": parsed["scenes"]
    }

This node sends the creative brief to Claude and gets back a fully structured scene list with all the parameters HyperFrames needs.

Step 3: Generate Audio with ElevenLabs

With the script ready, audio generation can run in parallel with visual generation. Create nodes/generate_audio.py:

import requests
import os
from state import VideoWorkflowState

ELEVENLABS_API_KEY = os.environ["ELEVENLABS_API_KEY"]

def generate_audio_node(state: VideoWorkflowState) -> VideoWorkflowState:
    voice_id = state.get("voice_id", "21m00Tcm4TlvDq8ikWAM")  # default voice
    
    url = f"https://api.elevenlabs.io/v1/text-to-speech/{voice_id}"
    
    headers = {
        "Accept": "audio/mpeg",
        "Content-Type": "application/json",
        "xi-api-key": ELEVENLABS_API_KEY
    }
    
    data = {
        "text": state["script"],
        "model_id": "eleven_monolingual_v1",
        "voice_settings": {
            "stability": 0.5,
            "similarity_boost": 0.75
        }
    }
    
    response = requests.post(url, json=data, headers=headers)
    
    audio_path = f"output/{state['title'].replace(' ', '_')}_narration.mp3"
    os.makedirs("output", exist_ok=True)
    
    with open(audio_path, "wb") as f:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
    
    return {**state, "audio_path": audio_path}

ElevenLabs handles the synthesis and returns an audio file. The path gets stored in state so the final assembly node can find it.

Step 4: Generate Video Clips with HyperFrames

This node iterates through the scenes list and generates an animated clip for each one. Create nodes/generate_visuals.py:

import requests
import os
import time
from state import VideoWorkflowState, Scene

HYPERFRAMES_API_KEY = os.environ["HYPERFRAMES_API_KEY"]
HYPERFRAMES_BASE_URL = "https://api.hyperframes.ai/v1"

def generate_clip_for_scene(scene: Scene, output_dir: str) -> str:
    headers = {
        "Authorization": f"Bearer {HYPERFRAMES_API_KEY}",
        "Content-Type": "application/json"
    }
    
    payload = {
        "prompt": scene["visual_prompt"],
        "duration": scene["duration_seconds"],
        "camera_motion": scene["camera_motion"],
        "transition_out": scene["transition"],
        "output_format": "mp4"
    }
    
    response = requests.post(
        f"{HYPERFRAMES_BASE_URL}/generate",
        json=payload,
        headers=headers
    )
    
    job_id = response.json()["job_id"]
    
    # Poll for completion
    while True:
        status_response = requests.get(
            f"{HYPERFRAMES_BASE_URL}/jobs/{job_id}",
            headers=headers
        )
        status_data = status_response.json()
        
        if status_data["status"] == "completed":
            clip_url = status_data["output_url"]
            break
        elif status_data["status"] == "failed":
            raise Exception(f"HyperFrames job failed: {status_data.get('error')}")
        
        time.sleep(3)
    
    # Download the clip
    clip_path = f"{output_dir}/scene_{scene['index']:03d}.mp4"
    clip_data = requests.get(clip_url)
    with open(clip_path, "wb") as f:
        f.write(clip_data.content)
    
    return clip_path

def generate_visuals_node(state: VideoWorkflowState) -> VideoWorkflowState:
    output_dir = f"output/{state['title'].replace(' ', '_')}_clips"
    os.makedirs(output_dir, exist_ok=True)
    
    updated_scenes = []
    errors = list(state.get("errors", []))
    
    for scene in state["scenes"]:
        try:
            clip_path = generate_clip_for_scene(scene, output_dir)
            updated_scenes.append({**scene, "clip_path": clip_path})
        except Exception as e:
            errors.append(f"Scene {scene['index']} failed: {str(e)}")
            updated_scenes.append(scene)
    
    return {**state, "scenes": updated_scenes, "errors": errors}

Each scene runs as a separate HyperFrames job. The polling loop waits for completion before moving on to the next clip.

Step 5: Assemble the Final Video

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

Once all clips are generated and audio is ready, a final assembly node combines everything. This uses ffmpeg under the hood:

import subprocess
import os
from state import VideoWorkflowState

def assemble_video_node(state: VideoWorkflowState) -> VideoWorkflowState:
    clips = [s["clip_path"] for s in state["scenes"] if s.get("clip_path")]
    
    if not clips:
        return {**state, "errors": state["errors"] + ["No clips to assemble"]}
    
    # Create concat file for ffmpeg
    concat_path = "output/concat_list.txt"
    with open(concat_path, "w") as f:
        for clip in clips:
            f.write(f"file '{os.path.abspath(clip)}'\n")
    
    merged_video_path = f"output/{state['title'].replace(' ', '_')}_video_only.mp4"
    
    # Concatenate clips
    subprocess.run([
        "ffmpeg", "-f", "concat", "-safe", "0",
        "-i", concat_path,
        "-c", "copy",
        merged_video_path
    ], check=True)
    
    # Merge with audio
    final_output = f"output/{state['title'].replace(' ', '_')}_final.mp4"
    
    subprocess.run([
        "ffmpeg",
        "-i", merged_video_path,
        "-i", state["audio_path"],
        "-c:v", "copy",
        "-c:a", "aac",
        "-shortest",
        final_output
    ], check=True)
    
    return {**state, "final_video_path": final_output}

Step 6: Wire Everything Together with Archon

Now the pieces need to connect. Create workflow.py to define the Archon graph:

from langgraph.graph import StateGraph, END
from state import VideoWorkflowState
from nodes.generate_script import generate_script_node
from nodes.generate_audio import generate_audio_node
from nodes.generate_visuals import generate_visuals_node
from nodes.assemble_video import assemble_video_node

def build_video_workflow():
    graph = StateGraph(VideoWorkflowState)
    
    # Add nodes
    graph.add_node("generate_script", generate_script_node)
    graph.add_node("generate_audio", generate_audio_node)
    graph.add_node("generate_visuals", generate_visuals_node)
    graph.add_node("assemble_video", assemble_video_node)
    
    # Define edges
    graph.set_entry_point("generate_script")
    graph.add_edge("generate_script", "generate_audio")
    graph.add_edge("generate_script", "generate_visuals")
    graph.add_edge("generate_audio", "assemble_video")
    graph.add_edge("generate_visuals", "assemble_video")
    graph.add_edge("assemble_video", END)
    
    return graph.compile()

if __name__ == "__main__":
    workflow = build_video_workflow()
    
    initial_state = {
        "brief": "A 60-second explainer about how coral reefs are formed, aimed at middle school students.",
        "voice_id": "21m00Tcm4TlvDq8ikWAM",
        "scenes": [],
        "errors": []
    }
    
    result = workflow.invoke(initial_state)
    print(f"Final video: {result['final_video_path']}")
    if result["errors"]:
        print(f"Errors encountered: {result['errors']}")

Note that generate_audio and generate_visuals both start after generate_script completes. Archon’s LangGraph foundation handles the parallel execution automatically — both nodes run simultaneously, and assemble_video only triggers once both finish.

Common Problems and How to Fix Them

Audio and video lengths don’t match

The most common issue. ElevenLabs adjusts pacing based on the text, so the audio might run longer or shorter than the combined clip duration. Fix this in the assembly node by calculating the total scene duration from your state and passing explicit timing parameters when generating the audio. You can also use the -shortest flag in ffmpeg to trim to the shorter of the two — but it’s better to match them upstream.

HyperFrames jobs time out

Network timeouts happen with long generation jobs. Add exponential backoff to your polling loop and set a max retry count. If a scene fails after retries, log the error and continue — a partial video with a placeholder for the failed scene is better than a crashed pipeline.

Claude returns malformed JSON

Claude is generally reliable with structured output, but it can occasionally wrap JSON in markdown code blocks. Add a cleanup step that strips json ... markers before parsing. Using response_format parameters where available also helps enforce clean output.

Scenes feel visually disconnected

This is a prompt quality issue. Have Claude include visual consistency instructions in each scene prompt — recurring color palette, lighting style, character descriptions — rather than treating each scene as independent. You can inject a visual_style_guide field into your state and append it to every HyperFrames prompt.

How MindStudio Fits Into This Kind of Workflow

The pipeline above is powerful, but it requires local setup, environment variables, Python dependencies, and ongoing maintenance. If you want to expose this workflow to non-technical team members, run it on a schedule, or trigger it from a Slack message or form submission, you need infrastructure on top of the code.

That’s where MindStudio’s AI Media Workbench fits in. MindStudio includes direct access to video generation models, voice synthesis, and media processing tools — all in one place, without managing separate API accounts. You can build the same kind of multi-step video production workflow visually, connect it to triggers (a form, a webhook, a scheduled job), and let team members run it without touching the terminal.

For developers who’ve already built something like the pipeline above, MindStudio’s Agent Skills Plugin (@mindstudio-ai/agent) is worth knowing about. It’s an npm SDK that lets Claude Code and other agentic systems call MindStudio’s 120+ typed capabilities directly — including agent.generateImage(), agent.runWorkflow(), and media processing tools — as simple method calls. That means you can keep Claude Code as your orchestration layer while offloading the infrastructure concerns (rate limiting, retries, auth) to MindStudio.

You can try MindStudio free at mindstudio.ai.

FAQ

What is a video generation workflow?

A video generation workflow is an automated pipeline that takes an input — usually a text brief or script — and produces a video output by coordinating multiple AI tools. Each tool handles a specific part of the production: writing, visual generation, audio synthesis, and assembly. The workflow defines the order and logic for calling each tool.

Can Claude Code actually orchestrate multi-tool pipelines?

Yes, though it works best when paired with a proper orchestration layer like Archon or LangGraph. Claude Code is excellent at writing the logic for each node and helping you reason through the workflow design, but it benefits from an explicit state machine to handle parallel execution, retries, and conditional branching in production.

How long does this kind of pipeline take to run?

It depends heavily on video length and the number of scenes. A 60-second video with 6 scenes might take 5–10 minutes end to end. HyperFrames generation is typically the bottleneck. Running audio and visual generation in parallel (as shown in the Archon graph above) cuts total time compared to running them sequentially.

Do I need separate API accounts for every tool?

For the stack described here — Claude, HyperFrames, ElevenLabs — yes, you’ll need individual API keys. Platforms like MindStudio consolidate access to many of these capabilities under one account if you’d rather not manage multiple subscriptions and keys.

What makes HyperFrames different from other video generation tools?

HyperFrames offers frame-level parameterization — you specify camera motion, transition type, and duration per scene rather than prompting a single model for the whole video. This gives you more narrative control and makes it easier to maintain visual consistency across a longer piece. For automated workflows where you’re generating structured scene data programmatically, this API design is a natural fit.

Is this workflow suitable for production use?

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

The architecture is production-capable, but the code here is illustrative. Before running this in production, you’d want to add proper error handling throughout, logging, a queue system for managing multiple concurrent jobs, and cost controls to avoid runaway API spending on long videos or large batches.

Key Takeaways

Claude Code handles script generation and scene structuring — it turns a creative brief into structured data the rest of the pipeline can use.
HyperFrames gives you frame-level control over AI video generation, making it well suited for automated multi-scene production.
ElevenLabs handles voice synthesis via a simple REST API that fits cleanly into any pipeline.
Archon (LangGraph) enables proper orchestration: parallel execution, state management, and conditional branching across all nodes.
The biggest production concerns are audio/video length synchronization, HyperFrames timeout handling, and prompt consistency across scenes.
If you want to run this workflow without managing all the infrastructure yourself, MindStudio’s AI Media Workbench and Agent Skills Plugin offer a faster path to the same outcome.