Modeling Subjective Urban Perception with Human Gaze
2026-05-01 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionHuman-Computer Interaction
AI summaryⓘ
The authors created a new dataset called Place Pulse-Gaze that includes street images, eye-tracking data, and people's opinions about urban places. They study how looking behavior (gaze) helps predict how people perceive city scenes. Their experiments show that eye movements alone can suggest how a place feels, and combining gaze with detailed image information improves predictions. This work suggests understanding how people look at cities is important for better modeling urban perception.
urban perceptioneye-trackingstreet view imagesPlace Pulse-Gaze datasetgaze behaviorsemantic scene representationvisual representationmultimodal learningsubjective perceptionurban computing
Authors
Lin Che, Xi Wang, Marc Pollefeys, Konrad Schindler, Martin Raubal, Peter Kiefer
Abstract
Urban perception describes how people subjectively evaluate urban environments, shaping how cities are experienced and understood. Existing computational approaches primarily model urban perception directly from street view images, but largely ignore the human perceptual process through which such judgments are formed. In this paper, we introduce Place Pulse-Gaze, an urban perception dataset that augments street view images with synchronized eye-tracking recordings and individual perception labels. Based on this dataset, we propose a Gaze-Guided Urban Perception Framework to study how gaze behavior contributes to the modeling of subjective urban perception. The framework systematically investigates three complementary settings: gaze-only modeling, gaze fusion with explicit semantic scene representations, and gaze fusion with implicit richer visual representations. Experiments show that gaze alone already carries useful predictive signals for subjective urban perception, and that integrating gaze with scene representations further improves prediction under both semantic and richer visual representations. Overall, our findings highlight the importance of incorporating human perceptual processes into urban scene understanding and open a direction for gaze-guided multimodal urban computing.