Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery

2026-03-16Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors improved a method called SAM 3D Body, which creates 3D models of humans from single images but is usually slow. Their new approach, Fast SAM 3D Body, makes this process much quicker without needing extra training, by changing how the model processes images and simplifying part of the conversion from shapes to joint movements. This results in about a 10 times faster speed while keeping accuracy the same or better. They also showed it can be used for real-time robot control just from a video camera, unlike other methods that need special sensors on the person.

Monocular 3D human mesh recoverySAM 3D BodyTransformer decodingArchitecture-aware pruningSMPL modelFeedforward mappingJoint-level kinematicsTeleoperationRGB-based humanoid controlReal-time inference
Authors
Timing Yang, Sicheng He, Hongyi Jing, Jiawei Yang, Zhijian Liu, Chuhang Zou, Yue Wang
Abstract
SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000x. Overall, our framework delivers up to a 10.9x end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that-unlike methods reliant on wearable IMUs-enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.