ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models

2026-04-09Artificial Intelligence

Artificial Intelligence
AI summary

The authors point out that current tests for AI language models mainly check how well they remember facts, but don’t test if these models can use learned skills automatically without thinking. They created ImplicitMemBench, a new test suite that measures this kind of ‘implicit memory’ using ideas from psychology like learning new skills quickly, being influenced by recent experiences, and forming associations. They tested 17 AI models and found that none performed close to humans, showing big gaps especially in areas like ignoring bad choices versus preferring good ones. Their work suggests that improving AI memory needs new designs, not just bigger models.

Implicit MemoryProcedural MemoryPrimingClassical ConditioningNon-declarative MemoryMemory BenchmarkLearning/Priming-Interfere-Test ProtocolLanguage ModelsModel EvaluationCognitive Science
Authors
Chonghan Qin, Xiachong Feng, Weitao Ma, Xiaocheng Feng, Lingpeng Kong
Abstract
Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders. We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs drawn from standard cognitive-science accounts of non-declarative memory: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias via paired experimental/control instances), and Classical Conditioning (Conditioned Stimulus--Unconditioned Stimulus (CS--US) associations shaping first decisions). Our 300-item suite employs a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring. Evaluation of 17 models reveals severe limitations: no model exceeds 66% overall, with top performers DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%) far below human baselines. Analysis uncovers dramatic asymmetries (inhibition 17.6% vs. preference 75.0%) and universal bottlenecks requiring architectural innovations beyond parameter scaling. ImplicitMemBench reframes evaluation from "what agents recall" to "what they automatically enact".