UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

2026-03-23Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors introduce UniMotion, a new model that can understand and create human motion, language, and images all together in one system. Unlike past models that only handle a few of these things and use a token-based method causing errors, UniMotion treats motion as continuous data just like images, improving accuracy and smoothness. They created special components to let motion and images share the same understanding space, and techniques to teach the model motion concepts even without images during testing. Their approach works well across many tasks involving any combination of motion, language, and images.

human motionnatural language processingRGB imagesvariational autoencoder (VAE)cross-modal learningdual-path embeddingKL divergenceself-supervised learninglatent spacemotion generation
Authors
Ziyi Wang, Xinshun Wang, Shuang Chen, Yang Cong, Mengyuan Liu
Abstract
We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder's richer posterior into the motion-only encoder. To address the cold-start problem -- where text supervision alone is too sparse to calibrate the newly introduced motion pathway -- we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.