Seek to Segment: Active Perception for Panoramic Referring Segmentation

2026-07-02Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors introduce a new task called Active Panoramic Referring Segmentation (APRS), where an agent actively looks around a full 360-degree environment to find and segment an object based on a user's description. They propose PanoSeeker, a smart agent that remembers what it has seen using a special memory system called EgoSphere, helping it avoid searching the same places repeatedly. PanoSeeker learns from expert examples and improves using a method that rewards efficient searching. Their tests show PanoSeeker finds and segments objects better and faster than existing methods adapted for this task.

Referring SegmentationEmbodied AI360-degree EnvironmentActive PerceptionVision-Language ModelEgoSphereMemory-Augmented AgentSupervised Fine-TuningReinforcement LearningSearch Efficiency
Authors
Song Tang, Shuming Hu, Xincheng Shuai, Henghui Ding, Yu-Gang Jiang
Abstract
Existing referring segmentation models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in the continuous 360$^\circ$ environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS). In this setting, an agent is required to adjust its viewing direction ($Δθ, Δφ$) to explore the 360$^\circ$ environment, seeking the object specified by a user instruction for segmentation. To tackle this challenging task, we propose PanoSeeker, a memory-augmented agent for efficient APRS. Rather than relying on heuristic scanning, PanoSeeker integrates a Vision-Language Model (VLM) with EgoSphere, an explicit spatial visual memory. By progressively integrating sequential local observations into a unified 360$^\circ$ representation, EgoSphere enables the agent to plan efficient and non-redundant search trajectories. Once the target is found, the agent performs active viewpoint alignment and outputs the segmentation mask. Furthermore, we curate an expert-annotated search trajectory dataset with memory timelines for Supervised Fine-Tuning, followed by Reinforcement Learning post-training to explicitly optimize PanoSeeker's exploration efficiency. Extensive experiments on our newly established APRS benchmark demonstrate that PanoSeeker achieves superior search efficiency and segmentation accuracy, significantly outperforming adapted state-of-the-art baselines.