CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation

2026-06-03Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics
AI summary

The authors address the problem of figuring out where a photo taken from the ground matches a location on aerial maps. They note that past methods either look across large areas but aren’t very precise, or are precise but only work in small areas. Their new model, CIPER, combines these approaches into one system that can both search big areas and pinpoint exact positions using shared smart features. CIPER uses transformers, a type of AI model, to better understand the differences between ground and aerial images and to estimate location and orientation together. Their experiments show that CIPER works well even when images have a limited view or unknown directions.

Cross-view geo-localizationImage retrievalPose estimationTransformer encoder3-DoF regressionBidirectional cross-attentionGlobal featuresSpatial localizationVIGOR datasetDomain gap
Authors
Yurim Jeon, Dongseong Seo, Seung-Woo Seo
Abstract
Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of localization accuracy, while pose estimation methods achieve high precision within only a narrow search space. Naively cascading these pipelines introduces error propagation and inconsistent feature representations. We formulate cross-view geo-localization as a unified problem requiring simultaneous city-scale retrieval and precise 3-DoF pose estimation. We propose CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a single architecture that jointly performs both tasks through mutually beneficial feature learning. CIPER uses a shared transformer encoder with task-specific tokens to disentangle global retrieval features from spatial localization cues. To bridge the large domain gap between ground and aerial views, we introduce a two-way transformer pose decoder that uses ground features as spatial queries for bidirectional cross-attention. A set prediction strategy further enables stable 3-DoF regression under a unified multi-task objective. Experiments on VIGOR, KITTI, and Ford Multi-AV demonstrate competitive performance, especially under limited field-of-view and arbitrary orientation conditions. Code is available at https://github.com/yurimjeon1892/CIPER.