EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking
2026-03-06 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors focus on understanding egocentric videos, which are complex because the camera and objects move a lot over time. They tackle specific tasks like counting how many times someone interacts with objects or figuring out where objects stay still, each requiring different types of thinking. Their solution, EgoReasoner, uses a two-step method that adapts how the model thinks and learns for each task, improving its reasoning and accuracy. By training on a relatively small dataset, their model performs significantly better than a larger existing model on a challenging video understanding benchmark.
Egocentric video4D reasoningChain-of-Thought (CoT)Task-adaptive thinkingReinforcement learningSpatial anchoringTemporal trackingHD-EPIC benchmarkReinforcement fine-tuningGRPO
Authors
Fangrui Zhu, Yunfeng Xi, Jianmo Ni, Mu Cai, Boqing Gong, Long Zhao, Chen Qu, Ian Miao, Yi Li, Cheng Zhong, Huaizu Jiang, Shwetak Patel
Abstract
Egocentric video understanding is inherently complex due to the dynamic 4D nature of the environment, where camera motion and object displacements necessitate a continuous re-evaluation of spatial relations. In this work, we target a suite of under-explored egocentric 4D reasoning tasks, including fixture interaction counting, viewpoint-relative fixture location, object movement itinerary tracking, and stationary object localization, that require fundamentally different cognitive operations: spatial anchoring, temporal tracking, and duration reasoning. We observe that these structural differences make task-agnostic approaches insufficient: generic Chain-of-Thought methods lack task-appropriate reasoning primitives, and uniform reinforcement learning actively destabilizes performance on spatial tasks. To address this, we propose EgoReasoner, a two-stage framework that aligns both the reasoning scaffold and the reward signal to each task's cognitive structure. In the first stage, Task-Adaptive Thinking Templates guide the synthesis of structured CoT traces that teach the model to reason adaptively across task types via supervised fine-tuning. In the second stage, task-aware reward functions verify entity grounding, temporal alignment, and task-adaptive logical consistency, selectively strengthening each reasoning pathway via reinforcement fine-tuning with GRPO. Our 3B-parameter model, trained on only 16K samples, achieves 37.5% average accuracy on the challenging HD-EPIC benchmark, surpassing Qwen2.5-VL-7B (25.7%) by over 10 points.