Lip Sync Model

LatentSync

LatentSync is an end-to-end audio-conditioned lip-sync framework that generates perfectly synchronized talking-head videos from any source video and target audio.

Start Building with LatentSync View All Models

Publisher

ByteDance

Type Lip Sync

Context Window 50,000 tokens

Training Data August 2025

Price $0.0001/gen

Provider

WaveSpeed

AUDIO+VIDEO

Try LatentSync →

About LatentSync

Audio-conditioned lip sync via latent diffusion

LatentSync is an end-to-end lip-synchronization model developed by ByteDance that takes a source talking-head video and a target audio track as inputs and produces a new video with mouth movements precisely aligned to the provided speech. It is built on audio-conditioned latent diffusion, meaning it operates directly in the latent space rather than relying on 3D meshes or 2D facial landmarks. The model supports multiple languages and accents, works with diverse speakers and recording conditions, and handles both real people and stylized characters such as anime.

A key technical feature of LatentSync is Temporal REPresentation Alignment (TREPA), a technique designed to reduce flicker, jitter, and frame-to-frame artifacts, keeping head pose and lip motion stable across longer sequences. The model is recommended for use with high-resolution source videos — 720p, 1080p, or 4K — paired with clean audio recordings. It is well suited for workflows involving lip-syncing, audio dubbing, digital human creation, and video-audio alignment.

Capabilities

What LatentSync supports

Lip Synchronization

Aligns mouth movements in a source video to a target audio track end-to-end, without relying on 3D meshes or 2D facial landmarks.

Audio Input Processing

Accepts an audio URL as a conditioning input, supporting multiple languages and accents across diverse speakers and recording conditions.

Video Input Processing

Takes a source talking-head video URL as input and preserves the subject's identity, pose, background, and scene structure in the output.

Temporal Consistency (TREPA)

Uses Temporal REPresentation Alignment to reduce flicker, jitter, and frame-to-frame artifacts, keeping motion stable over long sequences.

High-Resolution Output

Supports source videos at 720p, 1080p, or 4K resolution, delivering sharp facial detail for both real people and stylized characters.

Ready to build with LatentSync?

Get Started Free

FAQ

Common questions about LatentSync

What inputs does LatentSync require?

LatentSync requires two inputs: a video URL pointing to a source talking-head video and an audio URL pointing to the target audio track you want the subject to appear to speak.

What video resolution is recommended for best results?

The model documentation recommends using source videos at 720p, 1080p, or 4K resolution, paired with clean, dry audio recordings for optimal output quality.

Does LatentSync support languages other than English?

Yes, LatentSync is described as multilingual and supports multiple languages and accents, adapting to diverse speakers and recording conditions.

What is the context window for this model?

The model has a context window of 50,000 tokens as listed in its metadata.

Does LatentSync use facial landmark detection or 3D meshes internally?

No. LatentSync operates via audio-conditioned latent diffusion and does not rely on 3D meshes or 2D facial landmarks to generate lip-synced output.

What is the training data cutoff for LatentSync?

According to the model metadata, the training date is listed as August 2025.

Resources