MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On
2026-06-09 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors developed a virtual try-on method called MOFA-VTON that lets users adjust how clothes fit on a person's body using simple sketches. Unlike older methods that just swap clothes in a fixed way, this approach allows more precise control over where and how the clothes appear, making the look more realistic and varied. They use special mask techniques and attention-based blocks to separately manage the upper and lower clothing areas for better flexibility. Tests show that their method works better than previous ones and offers more options for virtual fashion try-on.
virtual try-onmask constructioncross-attention mechanismlayout adjustmentVITON-HD datasetDressCode datasetfine-grained adaptationsketch-based guidanceimage generation
Authors
Xiaoyu Han, Chenyang Wang, Jing Wang, Shunyuan Zheng, Quanling Meng, Shengping Zhang
Abstract
Virtual try-on aims to fit an in-shop clothing image onto a specific human body. An optimal virtual try-on method should provide diverse and flexible dressing options, accurately reflecting the varied wearing styles encountered in real-life scenarios, tailored to individual preferences and fashion aspirations. However, current methods predominantly perform a direct replacement of the original clothing with the target clothing, following the same dressing pattern. This limited control over clothing adaptation may result in fixed and monotonous try-on outputs. To delve into More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On, we propose a novel virtual try-on method, termed MOFA-VTON, which allows adjustment for clothing adaptations in try-on results through simple sketches by users. Specifically, we first design a mask construction strategy that transforms user-drawn curve sketches into a dual-region mask, replacing the traditional clothing-agnostic mask and providing fine-grained layout guidance for the subsequent generation process. Further, we propose layout adjustment blocks that utilize the cross-attention mechanism to independently learn layout correspondences for upper and lower regions of the human body, refining the spatial arrangement of the two regions. With these implementations, our method enables flexible and fine-grained adaptations of target clothing, overcoming the constraints of a fixed layout. Extensive experiments on VITON-HD and DressCode datasets demonstrate that our proposed MOFA-VTON outperforms previous state-of-the-art methods and provides more fashion possibilities for virtual try-on.