GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation
2026-03-27 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors propose a new way to create 3D scenes called GaussianGPT, which uses a transformer model to build scenes piece by piece. Instead of refining the whole scene at once like other methods, their approach generates small parts step-by-step, making it easier to complete or extend scenes and control the output. They compress 3D shapes into codes and use a special transformer to predict the next part, capturing both structure and appearance. This method works well with existing 3D rendering tools and offers a different option for generating 3D content.
3D generative modelingautoregressive modelingtransformersGaussian primitivesvector quantization3D convolutional autoencoderrotary positional embeddingneural renderingdiffusion modelscausal transformer
Authors
Nicolas von Lützow, Barbara Rössle, Katharina Schmid, Matthias Nießner
Abstract
Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene generation. We first compress Gaussian primitives into a discrete latent grid using a sparse 3D convolutional autoencoder with vector quantization. The resulting tokens are serialized and modeled using a causal transformer with 3D rotary positional embedding, enabling sequential generation of spatial structure and appearance. Unlike diffusion-based methods that refine scenes holistically, our formulation constructs scenes step-by-step, naturally supporting completion, outpainting, controllable sampling via temperature, and flexible generation horizons. This formulation leverages the compositional inductive biases and scalability of autoregressive modeling while operating on explicit representations compatible with modern neural rendering pipelines, positioning autoregressive transformers as a complementary paradigm for controllable and context-aware 3D generation.