Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

2026-03-10Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionInformation Retrieval
AI summary

The authors focus on improving how computers match written descriptions with 3D human motions. Instead of using a simple summary for the whole motion and text, they break down motions into detailed joint movements and represent these as images. They use a special method to compare text and motion piece by piece, making the match more accurate and easier to understand. Tests show their approach works better than previous ones and helps explain which parts of the text match specific movements.

text-motion retrieval3D human motionjoint-angle representationVision TransformerMaxSimMasked Language Modelingdual-encoder frameworklatent spacetoken-wise interaction
Authors
Yao Zhang, Zhuchenyang Liu, Yanlan He, Thomas Ploetz, Yu Xiao
Abstract
Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment. Extensive experiments on HumanML3D and KIT-ML show that our method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion. The code is available in the supplementary material.