Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

2026-05-12 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors studied how computer programs that help users control on-screen tasks (like GPT-5.4 and Claude) often struggle with complicated or rare actions. They found that most mistakes happen during these complex tasks because there isn’t enough training data for them. To fix this, they created a new test called CUActSpot that checks how well models handle many types of interactions, not just simple clicks. They also made a system to generate lots of training examples automatically, and their new model, Phi-Ground-Any-4B, did better than similar ones with more parameters.

Computer-use agentsGUI operationsLong-tail patternBenchmarkData synthesisMultimodal interactionsLarge language modelsTask failureTraining data scarcityInstruction generation

Authors

Miaosen Zhang, Xiaohan Zhao, Zhihong Tan, Zhou Huoshen, Yijia Fan, Yifan Yang, Kai Qiu, Bei Liu, Justin Wagle, Chenzhong Yin, Mingxi Cheng, Ji Li, Qi Dai, Chong Luo, Xu Yang, Xin Geng, Baining Guo

Abstract

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git

View PDFOpen arXiv