AI summaryⓘ
The authors explain that current ways of checking if AI language models match what people want usually ask for opinions right after using the AI, assuming preferences don’t change. They argue that people’s feelings about AI decisions can change over time as they see real results, so evaluations should happen over longer periods. To do this, the authors created a system called BITE that collects initial preferences, later reflections, and anonymous behavior data. Their study with 8 users over two weeks showed that preferences can differ between immediate reactions and later thoughts. This suggests that looking at preferences at only one moment is not enough to understand how well AI aligns with user needs.
human-AI alignmentlarge language modelspreference elicitationlongitudinal studybehavioral tracescontext-situated evaluationuser consentinteraction reflection
Authors
Simret Araya Gebreegziabher, Allison E Sproul, Yinuo Yang, Chaoran Chen, Diego Gómez-Zará, Toby Jia-Jun Li
Abstract
Current human-AI alignment and evaluation methods for large language models (LLMs) often rely on preference signals collected immediately after an interaction. This practice implicitly treats preference as static, even though many LLM-mediated decisions unfold over time and may be re-evaluated differently after real-world consequences and observed outcomes. Therefore, we argue for a methodological shift from single-moment preference elicitation to longitudinal, context-situated alignment measurement. We present a methodological framework for collecting temporally grounded alignment signals by combining (1) in-situ preference capture, (2) context-triggered follow-up preference reflection, and (3) privacy-preserving behavioral traces that help interpret preference change. As an instantiation of this methodology, we introduce BITE, a browser-based system that detects consequential LLM interactions, prompts reflection across later decision points, and supports progressive, user-controlled consent for sharing behavioral data. Through a two week longitudinal deployment study with 8 participants, our approach surfaced differences between immediate and later user preferences in accuracy, relevance and other dimensions of the LLM output. Our findings highlight the limitations of single-moment preference datasets and underscore the importance of longitudinal methods for alignment evaluation in everyday use.