G3T Up! Gravity Aligned Coordinate Frames Simplify Pointmap Processing

2026-05-26Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors found that predicting 3D shapes with respect to the camera's position isn’t always the best way to understand a scene. Instead, they suggest using a fixed frame aligned with gravity, so the 'up' direction is the same no matter the camera angle. They created a new method called Gravity Grounded Geometry Transformer (G3T) that learns to predict 3D shapes and the camera's orientation relative to gravity. They also developed G3T-Long, which builds 3D maps step-by-step using this gravity-based approach, leading to more accurate 3D reconstructions.

3D reconstructionpointmapscamera-centric coordinate framegravity-aligned coordinate framerotational degrees of freedomGravity Grounded Geometry Transformer (G3T)camera-to-gravity poseincremental 3D reconstructionstructural cuessubmap-based reconstruction
Authors
Bharath Raj Nagoor Kani, Noah Snavely
Abstract
Modern feed-forward 3D reconstruction methods like VGGT predict pixel-aligned pointmaps in camera-centric coordinate frames. However, this choice of coordinate frame is not always optimal. We propose instead to predict pointmaps in upright, gravity-aligned frames that exploit strong structural cues present in many real-world scenes. Unlike camera-centric frames, gravity-aligned frames share a common vertical axis across viewpoints, reducing the rotational degrees of freedom needed to relate pointmaps to one another. To this end, we introduce the Gravity Grounded Geometry Transformer (G3T), fine-tuned from existing models on gravity-aligned 3D data. G3T produces highly accurate gravity-aware predictions, including upright pointmaps and camera-to-gravity poses. We further introduce G3T-Long, a submap-based incremental 3D reconstruction pipeline that leverages the reduced rotational degrees of freedom afforded by upright frames to achieve significantly improved reconstruction accuracy.