EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera

2026-05-12 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionGraphics

AI summaryⓘ

The authors developed EgoForce, a method that uses a single camera worn on the head to accurately estimate the 3D position and shape of a user’s hand from their viewpoint. Their method works with various types of cameras, including fisheye and wide-angle lenses, using one unified model. They combined a special forearm representation and a transformer network to improve hand pose estimation and overcome depth and scale ambiguities common in single-camera setups. Tests showed EgoForce performs better than previous methods and works well across different camera types, making it useful for augmented reality and similar applications.

3D hand pose estimationegocentric visionmonocular RGB cameradepth-scale ambiguitytransformer networkfisheye camerawide field of viewpose reconstructionray space solverforearm representation

Authors

Christen Millerdurai, Shaoxiang Wang, Yaxu Xie, Vladislav Golyanik, Didier Stricker, Alain Pagani

Abstract

Reconstructing the absolute 3D pose and shape of the hands from the user's viewpoint using a single head-mounted camera is crucial for practical egocentric interaction in AR/VR, telepresence, and hand-centric manipulation tasks, where sensing must remain compact and unobtrusive. While monocular RGB methods have made progress, they remain constrained by depth-scale ambiguity and struggle to generalize across the diverse optical configurations of head-mounted devices. As a result, models typically require extensive training on device-specific datasets, which are costly and laborious to acquire. This paper addresses these challenges by introducing EgoForce, a monocular 3D hand reconstruction framework that recovers robust, absolute 3D hand pose and its position from the user's (camera-space) viewpoint. EgoForce operates across fisheye, perspective, and distorted wide-FOV camera models using a single unified network. Our approach combines a differentiable forearm representation that stabilizes hand pose, a unified arm-hand transformer that predicts both hand and forearm geometry from a single egocentric view, mitigating depth-scale ambiguity, and a ray space closed-form solver that enables absolute 3D pose recovery across diverse head-mounted camera models. Experiments on three egocentric benchmarks show that EgoForce achieves state-of-the-art 3D accuracy, reducing camera-space MPJPE by up to 28% on the HOT3D dataset compared to prior methods and maintaining consistent performance across camera configurations. For more details, visit the project page at https://dfki-av.github.io/EgoForce.

View PDFOpen arXiv