Generalizable Sparse-View 3D Reconstruction from Unconstrained Images

2026-04-30 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors developed GenWildSplat, a new method to create 3D scenes from a few random photos taken outdoors, even when the photos have different lighting or temporary objects blocking the view. Unlike older methods that need lots of training for each new scene, GenWildSplat works quickly without extra training by predicting depth, camera position, and 3D structure all at once. It also adjusts the look of the scene for different lighting and uses semantic segmentation to ignore moving or temporary things. The authors showed that their method works well on standard tests, producing good quality 3D images in real time.

3D reconstructionsparse viewsdepth predictioncamera pose estimation3D Gaussiansappearance adaptationsemantic segmentationcurriculum learningfeed-forward networkreal-time inference

Authors

Vinayak Gupta, Chih-Hao Lin, Shenlong Wang, Anand Bhattad, Jia-Bin Huang

Abstract

Reconstructing 3D scenes from sparse, unposed images remains challenging under real-world conditions with varying illumination and transient occlusions. Existing methods rely on scene-specific optimization using appearance embeddings or dynamic masks, which requires extensive per-scene training and fails under sparse views. Moreover, evaluations on limited scenes raise questions about generalization. We present GenWildSplat, a feed-forward framework for sparse-view outdoor reconstruction that requires no per-scene optimization. Given unposed internet images, GenWildSplat predicts depth, camera parameters, and 3D Gaussians in a canonical space using learned geometric priors. An appearance adapter modulates appearance for target lighting conditions, while semantic segmentation handles transient objects. Through curriculum learning on synthetic and real data, GenWildSplat generalizes across diverse illumination and occlusion patterns. Evaluations on PhotoTourism and MegaScenes benchmark demonstrate state-of-the-art feed-forward rendering quality, achieving real-time inference without test-time optimization

View PDFOpen arXiv