tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction

2026-02-23Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors introduce tttLRM, a new 3D reconstruction model that uses a special layer called a Test-Time Training (TTT) layer to handle long sequences efficiently and with less computing power. This model compresses multiple images into a compact form that can be turned into different 3D formats, like Gaussian Splats, for practical use. Their approach allows the model to improve its 3D predictions progressively as more images come in. They also show that training the model on making new views of objects helps it learn faster and create better 3D reconstructions. Experiments confirmed that tttLRM outperforms current methods when rebuilding both single objects and entire scenes.

3D reconstructionTest-Time Trainingautoregressive modellatent spaceGaussian Splatsnovel view synthesisprogressive refinementlinear computational complexitypretraining
Authors
Chen Wang, Hao Tan, Wang Yifan, Zhiqin Chen, Yuheng Liu, Kalyan Sunkavalli, Sai Bi, Lingjie Liu, Yiwei Hu
Abstract
We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model's capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforward 3D Gaussian reconstruction compared to state-of-the-art approaches on both objects and scenes.