NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models
2026-06-03 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionComputation and Language
AI summaryⓘ
The authors created NextMotionQA, a new test set to better evaluate how well AI models understand human movements in videos. Unlike previous tests, their benchmark includes multiple types of questions with different difficulty levels and detailed tasks to find exactly where models struggle. They tested twelve vision-language models and found some weaknesses that older tests missed. The authors also checked if these models can judge motion descriptions accurately and found they do well on simple checks but not on detailed, part-level evaluations.
human motion understandingembodied AIvision-language modelsbenchmark datasetmultiple-choice question answeringvideo captioningerror correctionsemantic granularityCohen's kappamodel evaluation
Authors
Yong Cao, Chuqiao Li, Xianghui Xie, Gerard Pons-Moll, Andreas Geiger
Abstract
Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leaving them unable to diagnose where current models fail. To bridge this gap, we introduce NextMotionQA, a comprehensive benchmark that leverages vision-language models (VLMs) for semi-automated, expert-verified dataset. NextMotionQA features three complementary tasks: multiple-choice question answering, video captioning, and fine-grained error correction. Each task is systematically structured across three core semantic axes and stratified into three task complexity levels. Our extensive evaluation of twelve representative VLMs uncovers critical capability gaps and weakness that remain invisible under conventional, single-task evaluations. In a complementary direction, recent work has begun using VLMs as judges for text-to-motion evaluation; we ask whether they show the same degradation under harder tasks. We find that VLMs align strongly with expert ratings on coarse criteria (Cohen's κ=0.70) but break down on fine-grained, part-level judgment (κ=0.10), validating the paradigm in its strong regime while clarifying its limits.