How to Use LTX 2.3 Video-to-Video Controls (Pose, Depth, Edge) on LTX Studio Right Now
LTX 2.3 video-to-video is live on LTX Studio before open-source release. Here's how to use pose, depth, and edge controls — and which mode works best for what.
LTX 2.3 Video-to-Video Is Live Right Now — Before the Open-Source Release
You can be running pose, depth, and edge control on your footage inside LTX Studio in about ten minutes. The open-source release is coming — LTX has already shipped depth-to-video and canny-to-video for LTX 2 — but right now, the only place these controls exist is on the LTX Studio platform. If you want to experiment today rather than wait, this is how to do it.
The three modes available are pose control, depth control, and edge control. Each one extracts a different signal from your input video and uses that signal to guide generation. HDR support also landed in this release, which matters less for casual use but is significant for anyone doing professional post-production work. The practical finding from testing: depth mode outperformed pose for camera movement shots, which is the opposite of what you might assume going in.
Here’s what the rest of this post covers: what each mode actually does, when to use which one, the failure modes you’ll hit, and the workarounds that actually work.
What You’re Actually Getting With These Controls
Before touching the interface, it helps to have a mental model of what these three modes are doing under the hood.
Pose control extracts skeleton keypoints from your input video — joint positions, body orientation, limb angles — and uses those as a conditioning signal. The generator then has to produce a character whose body moves in that same skeleton pattern. The input video becomes a motion reference, not a visual reference.
Depth control extracts a per-frame depth map — essentially a grayscale representation of how far each pixel is from the camera. This is particularly useful when your camera is moving, because depth maps capture the parallax and spatial relationships that change as the viewpoint shifts. If you have a tracking shot, a dolly, or any kind of camera movement, depth is the mode that preserves that.
Edge control extracts the structural edges of your input — outlines, boundaries, the Canny-style contour map. It then tries to regenerate the content inside those edges. In theory this gives you fine-grained structural control. In practice it’s the most fragile of the three modes, and it’s the one most likely to produce body deformation artifacts.
The practical hierarchy from testing: depth works best for camera movement and environmental shots, pose works best for human subjects with clear body visibility, and edge is the experimental option you reach for when the other two don’t fit.
What You Need Before You Start
LTX Studio account. This is currently the only place LTX 2.3 video-to-video controls are available. You’ll need credits — the free tier exists but burns quickly when you’re testing multiple modes on the same clip.
Your input video. A few things matter here. First: your clip needs to be longer than roughly 2 seconds. This is a hard limitation — LTX 2.3 video-to-video simply fails on very short clips, and it’s not documented anywhere officially. Second: for pose control specifically, you want the subject’s face visible early in the clip. If you have an establishing shot that starts on a wide or a back-of-head angle, pose control will struggle with identity consistency throughout.
A reference image or style prompt. The video-to-video controls handle motion; you still need to tell the model what the output should look like. This can be a static image (for character-driven work), a style description in the prompt, or a restylized frame you’ve run through something else first.
Optional: video editing software. You’ll want this for the workarounds covered in the troubleshooting section — specifically for reversing clips and trimming tails.
Running Each Mode: Step by Step
Step 1: Prepare your input clip
Trim your clip to the segment you want to process. Keep it above 2 seconds — if you’re right at the edge, add a couple of seconds of padding at the end. You’ll cut the tail off later.
If your clip starts without a clear face or character reference (establishing shot, wide angle, subject walking away from camera), consider reversing the clip before upload. The logic: if the clip ends on a close-up but starts wide, reversing it means the model sees the face first and builds identity from that. You then reverse the output to restore the original direction. This sounds awkward but it works, and it’s the practical solution for footage where you don’t control the shot order.
Now you have: a clip that’s over 2 seconds, starts with your strongest reference frame, and is ready to upload.
Step 2: Upload to LTX Studio and select video-to-video
Navigate to the video-to-video section in LTX Studio. Upload your prepared clip. You’ll see the three control mode options: pose, depth, edge.
Select your mode based on the shot type:
- Camera movement, tracking shots, environmental footage → depth
- Human subject, clear body visibility, character animation → pose
- Structural/architectural content, experimental → edge
Now you have: your clip loaded with a control mode selected.
Step 3: Set your reference and prompt
This is where you define what the output looks like. You have a few options:
If you’re doing character restylization, provide a reference image. A single well-chosen frame works. If you want a specific visual style, describe it in the prompt — be specific about the aesthetic, not just the subject. “Early 1980s cel-shaded anime, flat color fills, visible line weight variation” will outperform “anime style.”
If you’re doing a stylization transfer (taking live-action footage and converting it to a different visual register), the prompt carries more weight. LTX 2.3 can handle this kind of transfer, though the output tends to land in a hybrid 3D-animation space rather than pure cel-shaded — which may or may not be what you want.
Now you have: a configured job ready to run.
Step 4: Run and evaluate by mode
Submit the job and evaluate the output with mode-specific expectations.
For depth mode: check that camera movement is preserved. The depth signal should carry the spatial relationships through the generation. If you had a tracking shot, the parallax should feel correct. This is where depth mode earns its keep — it’s the only mode that reliably handles camera motion.
For pose mode: check identity consistency across the clip, especially in profile shots and fast movement. Pose mode will nail the motion but can drift on facial identity, particularly if you’re not providing a strong character reference. Tattoos, accessories, and fine details are the first things to go.
For edge mode: check structural integrity. Edge mode is the most likely to produce body deformation artifacts — limbs bending in unexpected directions, proportions shifting. If you get clean output from edge mode, it’s a good sign. If you get deformation, that’s expected behavior, not a bug.
Now you have: output video in your chosen mode, evaluated against mode-appropriate criteria.
Step 5: Handle the tail
If you added padding to get above the 2-second minimum, trim it off now. If you reversed the clip for the face-first workaround, reverse the output back to the original direction.
This is also the point where you decide whether to composite the output or use it as-is. For multi-character scenes where different shots need different modes, you may want to process each shot separately and cut them together.
Now you have: a finished clip ready for use or further processing.
The Failure Modes You’ll Actually Hit
The 2-second cliff
- ✕a coding agent
- ✕no-code
- ✕vibe coding
- ✕a faster Cursor
The one that tells the coding agents what to build.
This is the most important thing to know going in. LTX 2.3 video-to-video does not work on clips shorter than approximately 2 seconds. It doesn’t produce degraded output — it fails. The workaround is to half-time the clip (slow it to 50% speed) before processing, which doubles the duration. You’ll lose temporal resolution and may get lip-sync issues as a result, but you’ll get output. After processing, cut the tail to remove the padding frames.
Pose mode and establishing shots
If your clip starts without a clear face — subject walking away, wide establishing shot, back-of-head angle — pose mode will struggle to maintain character identity. The model doesn’t have a strong reference to anchor to. The reversal workaround (run the clip backwards so the face appears first, process, then reverse the output) addresses this directly. It’s counterintuitive but effective.
Edge mode deformation
Edge mode is genuinely experimental. Body deformation artifacts — limbs at wrong angles, proportions shifting mid-clip — are common. This isn’t a settings problem you can tune away; it’s the mode being fragile. Use edge mode when you specifically want structural control and you’re prepared to iterate. Don’t use it as a fallback when pose and depth don’t work.
Identity drift in pose mode
Even with a strong reference image, pose mode can drift on facial identity across a clip, especially on longer clips or fast motion. This is a known limitation of the vanilla model. The ComfyUI workflow combining IC-LoRA + ID-LoRA + prompt relay addresses this directly — the ID-LoRA component locks character identity more aggressively. That workflow runs LTX 2.3 locally, so it’s not available on LTX Studio, but it’s the open-source path for anyone who needs tighter identity consistency than the platform provides.
Credit burn rate
Testing multiple modes on the same clip burns credits fast. If you’re on a limited credit budget, decide on your mode before running — don’t use credits to compare all three on the same footage unless you have a specific reason to.
Where to Take This Further
The open-source release of these controls is coming. LTX’s track record — depth-to-video and canny-to-video both shipped for LTX 2 — suggests this isn’t vaporware. When it lands in ComfyUI, the combination of pose/depth/edge controls with the existing IC-LoRA + ID-LoRA + prompt relay workflow will be significant. Right now those two things exist in separate environments (LTX Studio for the controls, ComfyUI for the LoRA stack). When they converge, you’ll have a locally-runnable pipeline with both motion control and identity consistency.
For the HDR support specifically: if you’re doing professional work where color grading and dynamic range matter, HDR is worth understanding even though it adds a step. The LTX Studio platform exposes this, and it’s the kind of feature that matters more as the outputs get used in actual production pipelines rather than just demos.
If you’re building workflows that chain video generation with other AI steps — say, generating a reference image, running it through video-to-video, then doing automated quality checks — MindStudio handles this kind of multi-model orchestration visually, with 200+ models, 1,000+ integrations, and a workflow builder that doesn’t require you to write the glue code yourself. It’s worth looking at when the problem shifts from “run one model” to “chain several models with conditional logic between them.”
Remy doesn't write the code. It manages the agents who do.
Remy runs the project. The specialists do the work. You work with the PM, not the implementers.
The depth mode finding deserves emphasis because it’s not obvious: for any shot with camera movement, depth is the right mode even though pose feels like the intuitive choice for human subjects. The depth map carries the spatial information that makes camera motion feel correct. Pose carries skeleton information, which is great for body movement but doesn’t encode the camera’s relationship to the scene. If you’re shooting with any kind of camera motion — even subtle handheld movement — start with depth.
For anyone interested in training their own video models on custom footage, there’s an open-source dataset creation tool that points at a folder, slices videos into training datasets, handles cropping and tagging automatically, and has an 8-minute YouTube tutorial (Chinese with subtitles, English UI toggle available). This is the upstream step — building the data that would eventually let you fine-tune a model on your specific characters or visual style. It’s not beginner territory, but if you’ve been curious about how video training datasets actually get built, it’s a concrete starting point.
The broader context: LTX 2.3 is not competing with Wan 2.1 or CogVideoX on raw quality. What LTX offers is a model you can run locally, make API calls to without the cost ceiling, and now control with pose/depth/edge signals. For AI video generation workflows where you need motion control rather than maximum fidelity, that’s a real trade-off worth making. The same logic applies when you’re deciding between browser-based tools and local pipelines — understanding how browser automation fits into Claude Code workflows gives you a useful mental model for thinking about where to run things and where to keep control local.
One thing worth building toward: the combination of first-frame/last-frame control (already available in LTX 2.3) with depth or pose control on the motion signal is a powerful compositional approach. You constrain the start and end states, and you constrain the motion path. That’s a lot of the creative problem solved before the model has to make any decisions. The LTX Desktop open-source video editor is being built around this same engine, which means these controls will eventually be accessible in a local nonlinear editing environment — not just through a web platform.
The controls are live. The credits are finite. The open-source release is coming but isn’t here yet. If you want to understand how these modes behave before everyone has access to them locally, now is the time to run the tests.
For those thinking about how to build production tooling on top of video generation APIs — the kind of thing where you’re chaining model calls, storing outputs, and building user-facing interfaces — Remy takes a different approach to that problem: you write the application as an annotated spec in markdown, and it compiles into a full TypeScript stack with backend, database, auth, and deployment. The spec is the source of truth; the code is derived output. It’s a different abstraction layer than prompt engineering, but relevant when the question shifts from “how do I generate a video” to “how do I build a product around video generation.”
Remy is new. The platform isn't.
Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.
The pose/depth/edge distinction is worth internalizing because it maps directly to what information you’re actually conditioning on. Pose = skeleton. Depth = spatial geometry. Edge = structural outline. Once you have that mental model, the mode selection becomes obvious for most shots — and the cases where it’s not obvious (fast motion, multi-character scenes, shots with both camera movement and character action) are exactly the cases where you need to test rather than assume.