Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation
2026-04-10 • Multimedia
MultimediaComputer Vision and Pattern Recognition
AI summaryⓘ
The authors studied how language models that also see images (vision-language models) can better mimic real users looking at recommendation layouts by matching where users look on the screen. They found that people have unique and stable eye movement patterns that predict what they click on in a carousel-style recommendation. To use this, the authors created a method to adjust the model's attention to match each user's eye fixations, improving how well the model predicts clicks. Their experiments showed that guiding the model to 'look' like the user helps it better simulate real user behavior in recommendations.
large language modelsvision-language modelsuser gaze patternseye-trackingrecommendation systemsattention alignmentsoft promptsclick predictioninterpretability operatorscarousel recommendation
Authors
Lingfeng Huang, Huizhong Guo, Tianjun Wei, Yingpeng Du, Zhu Sun
Abstract
Large language model (LLM) agents are increasingly deployed as scalable user simulators for recommender system evaluation. Yet existing simulators perceive recommendations through text or structured metadata rather than the visual interfaces real users browse-a critical gap, since attention over recommendation layouts is both visually driven and highly personalized. We investigate whether aligning a vision-language model's (VLM's) visual attention with user-specific gaze patterns can improve simulation fidelity. Analysis of a real-world eye-tracking dataset collected in a carousel-based recommendation setting reveals that users exhibit stable individual gaze patterns strongly predictive of click behavior. Building on this finding, we propose Fixation-Aligned Tuning for user Emulation (FixATE). Our approach first probes the VLM's internal visual attention via interpretability operators to obtain a slot-level relevance distribution comparable with human fixation, and then learns personalized soft prompts to steer the model's attention toward each user's characteristic fixation pattern. Experiments across three interpretability-based probing operators and two architecturally distinct VLM backbones demonstrate consistent improvements in both attention alignment and click prediction accuracy. These results suggest that making the model "see like the user" is a viable path toward simulators that more faithfully reproduce how users perceive and act in recommendation interfaces.