VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale
2026-02-26 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors developed a new 3D reconstruction method that works much faster and uses less memory, especially when dealing with lots of images. They found that the main problem with older methods is how they handle the scene’s geometry using varying-sized data, so they converted this to a fixed-size neural network model trained during testing. Their approach, called VGG-T3, processes images in a way that scales linearly with the number of views, allowing them to reconstruct 3D scenes from 1,000 images in under a minute. It also produces more accurate 3D point maps compared to other fast methods and can recognize locations in a scene using new images not seen during training.
3D reconstructionMulti-Layer Perceptrontest-time trainingKey-Value representationsoftmax attentionscene geometryvisual localizationlinear scalingoffline feed-forward methodsglobal scene aggregation
Authors
Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, Aljosa Osep
Abstract
We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.