Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching

2026-04-10Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors address the challenge of making stereo matching models trained on synthetic data work well on real-world images, especially in tricky areas like shiny or textureless surfaces. They introduce GREATEN, a method that uses surface normals—geometric information about object shapes that doesn’t change between domains—to improve robustness. Their system combines these normals with image features, uses special training tricks for shiny and transparent areas, and applies efficient attention mechanisms to handle difficult regions efficiently. Trained only on synthetic data, their approach significantly reduces errors on several real-world benchmarks and runs faster while supporting high-resolution images.

stereo matchingsynthetic-to-real generalizationsurface normalsnon-Lambertian surfacescontextual-geometric fusionspecular-transparent augmentationattention mechanismsdisparity estimationcross-domain shiftocclusion
Authors
Jiahao Li, Xinhong Chen, Zhengmin Jiang, Cheng Huang, Yung-Hui Li, Jianping Wang
Abstract
Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.