Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

2026-04-06Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors present Vanast, a new method that creates videos of a person wearing different clothes and moving in new ways, all from just one image of the person, images of the clothes, and a video showing the desired poses. Unlike older methods that do clothing transfer and animation separately—sometimes leading to mistakes like distorted clothes or losing the person's identity—Vanast combines everything into one step for better results. They developed special training data and a unique model design to keep the person's look consistent and the clothes accurate while allowing flexible garment changes. This approach works well for many types of clothes and movements.

virtual try-onpose-guided animationidentity preservationgarment transfervideo diffusion transformerstriplet supervisionzero-shot learninggenerative modelsimage synthesispose guidance
Authors
Hyunsoo Cha, Wonjung Woo, Byungjun Kim, Hanbyul Joo
Abstract
We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.