Policy-based Foveated Imaging and Perception
2026-06-01 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors developed a smart camera system that decides in real-time which parts of a super high-resolution image to capture in full detail, focusing on important areas while keeping the rest in lower resolution. This approach helps save power, memory, and speed by not recording every pixel at full quality all the time. Their method learns from past images to predict where to zoom in next, improving performance for tasks like recognizing objects. They tested this with simulations and a real 200-megapixel camera and showed it works well under tight limits on data and processing speed.
ultra-high-resolution sensorsfoveated imagingtask-aware acquisitiondual-stream sensor architecturepixel bandwidthattention policy learningreal-time image processingspatial downsamplinglatency constraintsperception-acquisition loop
Authors
Howard Xiao, Jan Ackermann, Boyang Deng, Gordon Wetzstein
Abstract
Ultra-high-resolution image sensors offer the potential to capture fine spatial details critical for many visual perception tasks, but acquiring and processing all pixels at full resolution is often infeasible under realistic bandwidth, latency, and power constraints. Existing approaches address this challenge through acquisition strategies such as spatial or temporal downsampling, which irrevocably discard information before task relevance can be assessed. In this work, we introduce a real-time, predictive, and task-aware foveated imaging system that operates directly at image acquisition time. Leveraging emerging dual-stream sensor architectures, our method dynamically allocates limited pixel bandwidth to task-relevant regions of interest while maintaining a low-resolution global context. We formulate foveated acquisition as a sensor attention policy-learning problem, in which past observations guide actions that determine future measurements, closing the perception-acquisition loop. Through extensive simulation across multiple perception tasks, we demonstrate that our approach achieves high task performance under strict pixel budgets and significantly outperforms relevant baselines operating at the same bandwidth. We further validate our system on a 200-megapixel dual-stream sensor, capturing real-world videos under realistic bandwidth and latency constraints, demonstrating the practical feasibility of task-driven, acquisition-time foveated imaging.