ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation

2026-03-09Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors focus on improving a method that estimates human poses in images by detecting all people and their key body points in one step. They found that methods relying on bounding boxes to locate people cause problems and reduce accuracy. To fix this, they created a new approach that directly predicts keypoints without using bounding boxes, making training and inference more aligned and efficient. Their model, ER-Pose, performs better and faster than previous methods on popular datasets, even with fewer computing resources.

multi-person pose estimationsingle-stage detectionkeypoint predictionbounding boxYOLOsample assignmentnon-maximum suppression (NMS)OKS lossMS COCOCrowdPose
Authors
Nanjun Li, Pinqi Cheng, Zean Liu, Minghe Tian, Xuanyin Wang
Abstract
Single-stage multi-person pose estimation aims to jointly perform human localization and keypoint prediction within a unified framework, offering advantages in inference efficiency and architectural simplicity. Consequently, multi-scale real-time detection architectures, such as YOLO-like models, are widely adopted for real-time pose estimation. However, these approaches typically inherit a box-driven modeling paradigm from object detection, in which pose estimation is implicitly constrained by bounding-box supervision during training. This formulation introduces biases in sample assignment and feature representation, resulting in task misalignment and ultimately limiting pose estimation accuracy. In this work, we revisit box-driven single-stage pose estimation from a keypoint-driven perspective and identify semantic conflicts among parallel objectives as a key source of performance degradation. To address this issue, we propose a keypoint-driven learning paradigm that elevates pose estimation to a primary prediction objective. Specifically, we remove bounding-box prediction and redesign the prediction head to better accommodate the high-dimensional structured representations for pose estimation. We further introduce a keypoint-driven dynamic sample assignment strategy to align training objectives with pose evaluation metrics, enabling dense supervision during training and efficient NMS-free inference. In addition, we propose a smooth OKS-based loss function to stabilize optimization in regression-based pose estimation. Based on these designs, we develop a single-stage multi-person pose estimation framework, termed ER-Pose. On MS COCO and CrowdPose, ER-Pose-n achieves AP improvements of 3.2/6.7 without pre-training and 7.4/4.9 with pre-training respectively compared with the baseline YOLO-Pose. These improvements are achieved with fewer parameters and higher inference efficiency.