DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction

2026-03-03 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors present DuoMo, a method that figures out how people move in real-world space using regular videos, even if those videos are messy or incomplete. They solve this by using two steps: first, they estimate movement relative to the camera, and then they adjust it to match the real-world setting accurately. This two-part approach allows their method to work well in many different scenes and motion paths, producing detailed and consistent results. Their method is also unique because it works directly on detailed body shapes instead of relying on simpler motion models. Tests show DuoMo reduces errors significantly compared to past methods.

human motion captureworld-space coordinatescamera-space coordinatesdiffusion modelsmotion reconstructionmesh verticesparametric modelsfoot skatingEMDB datasetRICH dataset

Authors

Yufu Wang, Evonne Ng, Soyong Shin, Rawal Khirodkar, Yuan Dong, Zhaoen Su, Jinhyung Park, Kris Kitani, Alexander Richard, Fabian Prada, Michael Zollhofer

Abstract

We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly and bypassing parametric models. DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space reconstruction error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space error. Project page: https://yufu-wang.github.io/duomo/

View PDFOpen arXiv