Learning a Particle Dynamics Model with Real-world Videos
2026-05-22 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors developed a new way to teach computers how objects move by watching real videos instead of using fake simulation data. They use a special method involving particles shaped like 3D blobs with direction and size, which helps the computer predict how objects move and rotate over time. Their model learns by comparing rendered images with real videos, so it doesn't need detailed labels about each particle's position. They also created a dataset of 500 real-world videos to help train and test their approach. This work aims to make physics simulation models work better in real-life situations.
physics simulationworld modelsparticle-based dynamicsGaussian splattingdifferentiable renderingreal-world videossim-to-real gapneural object dynamicsunsupervised learningpoint clouds
Authors
Chanho Kim, Suhas V. Sumukh, Li Fuxin
Abstract
Data-driven learning approaches for physics simulation, sometimes referred to as world models, have emerged as promising alternatives to traditional physics simulators due to their differentiable nature. Prior work has demonstrated impressive results in predicting the motions of rigid and non-rigid objects in complex scenes involving multiple interacting bodies. However, these models are typically trained in simulated environments because obtaining perfect state information such as complete scene point clouds and point correspondences over time is challenging in real-world settings. This reliance on synthetic data can limit their applicability when the sim-to-real gap is large. In this work, we aim to overcome these limitations by introducing a novel framework for training neural object dynamics models directly from unlabeled real-world videos. Specifically, we propose to learn a particle-based dynamics model compatible with a Gaussian splatting framework, which operates on dense particles derived from Gaussians (i.e., particles with scales and rotations) and predicts their position and rotation changes over time. The model is trained via rendering supervision, enabling learning from real-world videos without requiring particle-level labeled states. Our model operates directly on dense Gaussians without relying on heuristic subsampling anchor points. To enable this study, we also present a real-world dataset consisting of about 500 videos capturing diverse object interactions.