AI summaryⓘ
The authors point out that most computer agents working with graphical user interfaces (GUIs) assume the screen doesn't change unless the agent acts. But in apps like short-video platforms, the content keeps moving, making it harder for agents to decide what to watch and when to stop. They created a new challenge called LivingScreen that tests agents in this kind of dynamic environment, using a browser setup and measures for both how accurate and efficient the agents are. Their tests show current top models still struggle compared to humans, mainly because they watch too much or too little, suggesting that controlling what to observe is a key skill future agents need. They also provide the data and code for others to use.
GUI agentsshort-video applicationsdynamic interfacesLivingScreen benchmarkobservation controlaccuracy and efficiency metricsbrowser-based environmentagent evaluation
Authors
Jiashu Yao, Heyan Huang, Daiqing Wu, Wangke Chen, Huaxi Ai, Haoyu Wen, Zeming Liu, Yuhang Guo
Abstract
GUI agents today assume a static screen, where the world is frozen between two actions. However, real interfaces such as short-video applications violate this assumption, as their content keeps playing, and a competent user must decide what to watch and for how long. We formalize this task as Living-Screen-Native GUI agents and introduce LivingScreen, the first benchmark instantiating it on short-video platforms, with a faithful browser-based environment, a three-tier task suite, and metrics that jointly score accuracy and information efficiency. Evaluating extensive frontier models, we find that none reaches the human cost-accuracy performance, and that their dominant failure mode is over- and under-observation, pointing to observation control as a missing capability axis for future GUI agents. All data and code will be available at https://github.com/BITHLP/LivingScreen.