FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

2026-06-23 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors explore generating 3D scenes from a single image using video diffusion models that already capture multi-view geometry. They find that usual methods output blurry 3D blobs called volumetric Gaussians, which are hard to use in games or simulations. Their new method, called FLAT, directly creates flat triangle shapes (surface primitives) from the model’s latent space, which better represent real surfaces but are harder to predict. FLAT uses special math tricks to handle triangle orientation and improve training. Their approach produces more accurate 3D shapes and can be refined to work in real-time graphics engines.

video diffusion models3D scene generationvolumetric 3D Gaussianssurface primitivestriangle splattingray-centered rotation parameterizationdifferentiable renderinglatent spacegeometry accuracyreal-time rendering

Authors

Orest Kupyn, Goutam Bhat, Philipp Henzler, Fabian Manhardt, Christian Rupprecht, Federico Tombari

Abstract

Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Current video diffusion models offer high-quality generation and implicitly encode multi-view geometric structure in latent space. However, existing feedforward latent scene decoders typically output volumetric 3D Gaussians that lack a well-defined surface, limiting their use in simulation or standard graphics pipelines. This motivates decoding surface-aligned primitives that are not only renderable but also closer to explicit geometric assets. We ask whether compressed video diffusion latents can be mapped directly to explicit surface primitives in a single pass. To this end, we introduce FLAT and, for the first time, show that triangle splats can be decoded directly from video diffusion latents. Compared with decoding 3D Gaussians, predicting flat primitives is notoriously more challenging due to high sensitivity to primitive orientations, oftentimes leading to poor gradient flow. FLAT solves with two key ingredients: a ray-centered rotation parameterization for triangle regression and a novel product window function that improves gradient flow during differentiable triangle rendering. On standard benchmarks, FLAT achieves significantly better geometric accuracy while maintaining competitive visual quality compared to state-of-the-art feedforward baselines. We further show that a lightweight test-time refinement step converts the predicted triangle soup into a fully opaque, game-engine-ready representation that supports real-time rendering. By evaluating 3DGS, 2DGS, and triangle splatting variants under an identical training setup, we provide the first systematic analysis of representation tradeoffs in feedforward scene generation. The project page is available at https://flat-splat.github.io

View PDFOpen arXiv