Human-level 3D shape perception emerges from multi-view learning

2026-02-19Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created a new kind of neural network that learns to understand 3D shapes just by looking at multiple pictures taken from different places, without being told about the objects specifically. Their model predicts where cameras were placed and how far things are in the scene, similar to how humans get visual clues. They tested the model on a 3D perception task and found it matches human accuracy and even predicts small details in how people make mistakes and react. This work shows that human-like 3D understanding can come from learning simple visual-spatial information from natural scenes. The authors also shared all their data and code to help others repeat the experiments.

3D perceptionneural networksvisual-spatial learningmulti-view imagesdepth estimationcamera localizationzero-shot evaluationhuman behavior modelingvisual intelligencenaturalistic data
Authors
Tyler Bonnen, Jitendra Malik, Angjoo Kanazawa
Abstract
Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view' models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.