Personal Visual Context Learning in Large Multimodal Models

2026-05-11 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors focus on how smart devices like glasses can better understand what they see by learning from each user's unique visual surroundings, which they call Personal Visual Context Learning. They created a test called Personal-VCL-Bench to measure how well current models use this personal visual information. Their study found that existing large multimodal models aren't very good at using different visual clues from a person's environment. To address this, the authors developed the Agentic Context Bank, a system that organizes and selects helpful visual memories during use, which improves performance. This work helps guide future development of personalized AI assistants that understand visual context better.

Large Multimodal ModelsWearable DevicesVisual PersonalizationPrompt-time CapabilityPersonal Visual Context LearningBenchmarkingContext UtilizationMemory BankEvidence SelectionPersonalized AI Assistants

Authors

Zihui Xue, Ami Baid, Sangho Kim, Mi Luo, Kristen Grauman

Abstract

As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual evidence, as well as aggregating multiple visual observations, remain critically understudied. Motivated by these findings, we propose the Agentic Context Bank, a strong inference-time baseline that structures a user's visual context into a self-refining memory bank and employs query-adaptive evidence selection. Our baseline approach consistently improves over standard context prompting regimes across tasks and evaluated backbones, demonstrating a practical path towards future personalized LMMs.

View PDFOpen arXiv