6D Pose Estimation via Keypoint Heatmap Regression with RGB-D Residual Neural Networks

2026-05-08 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics

AI summaryⓘ

The authors present a system to figure out the exact 3D position and rotation (6D pose) of objects in images. They use a method that first detects objects using YOLOv10m and then finds special points (keypoints) on the object in the image using a ResNet18-based network. These points help calculate the object's 3D pose with a known algorithm called PnP RANSAC. They also improve their results by combining color images with depth information and try different training tweaks. Their approach works well on the LINEMOD dataset, with fusion of depth and color data giving the best accuracy.

6D pose estimationkeypoint heatmap regressionYOLOv10mResNet18PnP RANSACRGB-D fusionLINEMOD datasetactivation functionslearning rate scheduling

Authors

Ismail Aljosevic, Amir Masoud Almasi, Ana Parovic, Ashkan Shafiei

Abstract

In this paper, we propose a modular framework for 6D pose estimation based on keypoint heatmap regression. Our approach combines YOLOv10m for object detection with a ResNet18-based network that predicts 2D heatmaps from RGB images. Keypoints extracted from these heatmaps are used to estimate the 6D object pose via the PnP RANSAC algorithm. We compare different keypoint selection strategies to assess their impact on pose accuracy. Additionally, we extend the baseline by incorporating depth data using a cross-fusion architecture, which enables interaction between RGB and depth features at multiple stages. We further explore general training improvements, such as experimenting with activation functions and learning rate scheduling strategies to improve model performance. Our best RGB-only model achieved a mean ADD-based accuracy of 84.50%, while the RGB-D fusion model reached 92.41% on the LINEMOD dataset. The code is available at https://github.com/ameermasood/HeatNet.

View PDFOpen arXiv