Skip to main content
MindStudio
Pricing
Blog About
My Workspace
Lip Sync Model

LatentSync

LatentSync is an end-to-end audio-conditioned lip-sync framework that generates perfectly synchronized talking-head videos from any source video and target audio.

Publisher ByteDance
Type Lip Sync
Context Window 50,000 tokens
Training Data August 2025
Price Free/gen
Provider WaveSpeed
AUDIO+VIDEO

Audio-conditioned lip sync via latent diffusion

LatentSync is an end-to-end lip-synchronization model developed by ByteDance that takes a source talking-head video and a target audio track as inputs and produces a new video with mouth movements precisely aligned to the provided speech. It is built on audio-conditioned latent diffusion, meaning it operates directly in the latent space rather than relying on 3D meshes or 2D facial landmarks. The model supports multiple languages and accents, works with diverse speakers and recording conditions, and handles both real people and stylized characters such as anime.

A key technical feature of LatentSync is Temporal REPresentation Alignment (TREPA), a technique designed to reduce flicker, jitter, and frame-to-frame artifacts, keeping head pose and lip motion stable across longer sequences. The model is recommended for use with high-resolution source videos — 720p, 1080p, or 4K — paired with clean audio recordings. It is well suited for workflows involving lip-syncing, audio dubbing, digital human creation, and video-audio alignment.

What LatentSync supports

Lip Synchronization

Aligns mouth movements in a source video to a target audio track end-to-end, without relying on 3D meshes or 2D facial landmarks.

Audio Input Processing

Accepts an audio URL as a conditioning input, supporting multiple languages and accents across diverse speakers and recording conditions.

Video Input Processing

Takes a source talking-head video URL as input and preserves the subject's identity, pose, background, and scene structure in the output.

Temporal Consistency (TREPA)

Uses Temporal REPresentation Alignment to reduce flicker, jitter, and frame-to-frame artifacts, keeping motion stable over long sequences.

High-Resolution Output

Supports source videos at 720p, 1080p, or 4K resolution, delivering sharp facial detail for both real people and stylized characters.

Ready to build with LatentSync?

Get Started Free

Common questions about LatentSync

What inputs does LatentSync require?

LatentSync requires two inputs: a video URL pointing to a source talking-head video and an audio URL pointing to the target audio track you want the subject to appear to speak.

What video resolution is recommended for best results?

The model documentation recommends using source videos at 720p, 1080p, or 4K resolution, paired with clean, dry audio recordings for optimal output quality.

Does LatentSync support languages other than English?

Yes, LatentSync is described as multilingual and supports multiple languages and accents, adapting to diverse speakers and recording conditions.

What is the context window for this model?

The model has a context window of 50,000 tokens as listed in its metadata.

Does LatentSync use facial landmark detection or 3D meshes internally?

No. LatentSync operates via audio-conditioned latent diffusion and does not rely on 3D meshes or 2D facial landmarks to generate lip-synced output.

What is the training data cutoff for LatentSync?

According to the model metadata, the training date is listed as August 2025.

Parameters & options

Audio Audio URL

Audio to be synchronized.

Video Video URL

Video to be synchronized.

Start building with LatentSync

No API keys required. Create AI-powered workflows with LatentSync in minutes — free.