Lightweight LLM Agent Memory with Small Language Models

2026-04-09Artificial Intelligence

Artificial Intelligence
AI summary

The authors present LightMem, a memory system designed to help language models remember past information more efficiently when handling long conversations. LightMem uses smaller language models to organize memory into short-term, mid-term, and long-term parts, allowing quick access and effective storage without slowing down the system too much. It also separates quick online memory use from slower offline updates, improving both accuracy and speed. Their experiments show that LightMem boosts performance while keeping search times fast across different model sizes.

Large Language ModelsMemory SystemsRetrieval-based MemorySmall Language ModelsShort-term MemoryMid-term MemoryLong-term MemoryVector RetrievalSemantic Re-rankingLatency
Authors
Jiaquan Zhang, Chaoning Zhang, Shuxu Chen, Zhenzhen Huang, Pengcheng Zheng, Zhicheng Wang, Ping Guo, Fan Mo, Sung-Ho Bae, Jie Zou, Jiwei Wei, Yang Yang
Abstract
Although LLM agents can leverage tools for complex tasks, they still need memory to maintain cross-turn consistency and accumulate reusable information in long-horizon interactions. However, retrieval-based external memory systems incur low online overhead but suffer from unstable accuracy due to limited query construction and candidate filtering. In contrast, many systems use repeated large-model calls for online memory operations, improving accuracy but accumulating latency over long interactions. We propose LightMem, a lightweight memory system for better agent memory driven by Small Language Models (SLMs). LightMem modularizes memory retrieval, writing, and long-term consolidation, and separates online processing from offline consolidation to enable efficient memory invocation under bounded compute. We organize memory into short-term memory (STM) for immediate conversational context, mid-term memory (MTM) for reusable interaction summaries, and long-term memory (LTM) for consolidated knowledge, and uses user identifiers to support independent retrieval and incremental maintenance in multi-user settings. Online, LightMem operates under a fixed retrieval budget and selects memories via a two-stage procedure: vector-based coarse retrieval followed by semantic consistency re-ranking. Offline, it abstracts reusable interaction evidence and incrementally integrates it into LTM. Experiments show gains across model scales, with an average F1 improvement of about 2.5 on LoCoMo, more effective and low median latency (83 ms retrieval; 581 ms end-to-end).