See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback
2026-04-14 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors study how computer assistants can better click exactly where they need to in complex coding software, where tiny mistakes matter. They find that instead of guessing the right spot once, their agent keeps trying and adjusting based on visual feedback until it clicks correctly. This back-and-forth method works much better than older single-try approaches, leading to more precise clicks and higher success in coding tasks. Their work suggests that iterative correction is important for future software helpers.
Graphical User Interface (GUI)Computer Use Agents (CUAs)cursor localizationpixel-precisioniterative refinementclosed-loop feedbackcoding environmentsIDE elementslanguage instructionsscreen actions
Authors
Himangi Mittal, Gaurav Mittal, Nelson Daniel Troncoso, Yu Hu
Abstract
Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench.