Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search Systems

2026-04-09Information Retrieval

Information RetrievalArtificial Intelligence
AI summary

The authors studied how computer agents that use large language models (LLMs) perform tasks in a graphical user interface (GUI) compared to real people. They found that while the agents and humans had similar success in completing tasks and asked similar types of questions, they navigated the interface differently. Humans explored more and focused on content, while agents followed more direct, search-focused paths. This shows that just because agents get the right answer doesn’t mean they behave like humans, so it’s important to analyze their detailed actions when using them to mimic users.

LLMGUI agentstask successquery formulationnavigation strategiesuser behaviorhuman-computer interactiontrace-level evaluationmulti-hop searchaudio-streaming search
Authors
Maria Movin, Claudia Hauff, Aron Henriksson, Panagiotis Papapetrou
Abstract
LLM-driven GUI agents are increasingly used in production systems to automate workflows and simulate users for evaluation and optimization. Yet most GUI-agent evaluations emphasize task success and provide limited evidence on whether agents interact in human-like ways. We present a trace-level evaluation framework that compares human and agent behavior across (i) task outcome and effort, (ii) query formulation, and (iii) navigation across interface states. We instantiate the framework in a controlled study in a production audio-streaming search application, where 39 participants and a state-of-the-art GUI agent perform ten multi-hop search tasks. The agent achieves task success comparable to participants and generates broadly aligned queries, but follows systematically different navigation strategies: participants exhibit content-centric, exploratory behavior, while the agent is more search-centric and low-branching. These results show that outcome and query alignment do not imply behavioral alignment, motivating trace-level diagnostics when deploying GUI agents as proxies for users in production search systems.