Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline
2026-06-03 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors studied different memory systems used by large language model (LLM) agents to handle information beyond their immediate context. They tested eight existing memory designs plus a new approach called an agentic harness, which lets the agent actively control how it stores and retrieves information. By comparing these methods across various tasks, they found that active memory management works better than passive storage methods. Building on this, they developed AutoMEM, a system where the agent manages its own memory tools, achieving the most flexible and effective memory use across different scenarios.
LLM agentsmemory systemscontext windowmulti-session chatagentic trajectorysearch problemsstorage retrievalAutoMEMtool interfacelong-horizon tasks
Authors
Zhikai Chen, Jialiang Gu, Junyu Yin, Xianxuan Long, Shenglai Zeng, Xiaoze Liu, Kai Guo, Keren Zhou, Jiliang Tang
Abstract
LLM agents accumulate histories that outgrow their context windows, motivating a growing literature on memory systems. Yet most existing designs are tuned to a single scenario (multi-session chat or a single trajectory format), and there is little evidence that they generalize across the heterogeneous trajectories agents encounter in deployment. We revisit eight memory systems plus an agentic harness for search problems, on five scenarios: single-turn QA, multi-session chat, agentic-trajectory QA, memory stress tests, and long-horizon agentic tasks. The harness, which self-manages flat text-file storage via tool calls, achieves the best cross-task ranking, suggesting that memory performance hinges on giving the agent active control over storage and retrieval rather than on a passive store behind a fixed pipeline. We instantiate this insight in AutoMEM, an agentic memory harness with a self-managed tool interface that achieves the best cross-scenario generality among the systems we evaluate.