Face versus Body Tracking for Human-Robot Interaction: An Egocentric Dataset
2026-06-02 • Robotics
RoboticsComputer Vision and Pattern RecognitionHuman-Computer Interaction
AI summaryⓘ
The authors point out that robots need to keep track of people constantly to interact well, but current computer vision tools are designed for things like self-driving cars, not close-up social situations. They created a special dataset from the robot’s viewpoint to study how people move and block each other in real conversations. Their tests showed that remembering where people were helps when someone is hidden for a while, but complex movements still confuse the robot. Adding appearance checks helps with tracking bodies but makes face tracking errors worse because faces can look very different from side views. Overall, their improved system cuts identity mix-ups in half, helping robots maintain better conversations.
human-robot interactionegocentric datasetidentity switchesappearance re-identificationspatial memoryface trackingbody trackingocclusioncomputer visionsocial dynamics
Authors
Jessica Wenninger, Gabriel Skantze
Abstract
To enable meaningful human-robot interaction (HRI), a robot must continuously assess engagement by consistently tracking users over time. State-of-the-art computer vision models, however, are heavily optimized for surveillance or autonomous driving. A social robot faces distinct egocentric challenges, such as humans bouncing, obstructing each other, or leaving the frame. Frequent identity switches (IDSW) cause the robot to lose its footing mid-conversation. To address this, we introduce a novel, custom-annotated egocentric dataset collected via the Furhat robot to capture complex social dynamics. We present a systematic evaluation isolating detection errors from tracking logic, comparing face versus body tracking, and assessing the impact of extended spatial memory and appearance re-identification (ReID). Results indicate that increasing spatial memory mitigates prolonged occlusions but fails on complex dynamic events. Integrating ReID resolves complex switches but exhibits opposing effects: it substantially improves body tracking stability, yet causes facial IDSW to spike due to profile angle sensitivity. Ultimately, our optimized pipeline reduces IDSW by 49\%, mitigating interaction breakdowns. Because standard benchmarks lack dense, close-quarter occlusions, this work highlights the critical need for natively captured social dynamics to truly validate HRI perception models.