Scene-Agnostic Object-Centric Representation Learning for 3D Gaussian Splatting
2026-04-10 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors improve 3D scene understanding by teaching a model to recognize objects in a consistent way across different scenes, without needing extra complicated steps. They use a pre-trained system called GOCL to create a shared 'codebook' of object features that works for many scenes. This approach avoids retraining or fine-tuning for each new scene and helps identify objects directly from 3D data. Their method leads to better and more organized object representations useful for tasks like robotics and scene analysis.
3D scene understandingvisual foundation modelsradiance fields3D Gaussian Splattingslot attentionobject-centric learningunsupervised learningobject segmentationmulti-view alignmentcodebook
Authors
Tsuheng Hsu, Guiyu Liu, Juho Kannala, Janne Heikkilä
Abstract
Recent works on 3D scene understanding leverage 2D masks from visual foundation models (VFMs) to supervise radiance fields, enabling instance-level 3D segmentation. However, the supervision signals from foundation models are not fundamentally object-centric and often require additional mask pre/post-processing or specialized training and loss design to resolve mask identity conflicts across views. The learned identity of the 3D scene is scene-dependent, limiting generalizability across scenes. Therefore, we propose a dataset-level, object-centric supervision scheme to learn object representations in 3D Gaussian Splatting (3DGS). Building on a pre-trained slot attention-based Global Object Centric Learning (GOCL) module, we learn a scene-agnostic object codebook that provides consistent, identity-anchored representations across views and scenes. By coupling the codebook with the module's unsupervised object masks, we can directly supervise the identity features of 3D Gaussians without additional mask pre-/post-processing or explicit multi-view alignment. The learned scene-agnostic codebook enables object supervision and identification without per-scene fine-tuning or retraining. Our method thus introduces unsupervised object-centric learning (OCL) into 3DGS, yielding more structured representations and better generalization for downstream tasks such as robotic interaction, scene understanding, and cross-scene generalization.