AI summaryⓘ
The authors treat memory management for large language models (LLMs) as a skill that can be learned and improved, similar to how people develop memory expertise. They let the model control organizing and recalling information through file-like memory actions, and improve this skill by optimizing both the system supporting memory and the model’s actual use of memory. They created AutoMem, a method that uses a strong LLM to review memory usage and refine this support structure, while also training the model on its best memory decisions. Testing on long, complex games showed that better memory management alone can significantly boost performance without changing other behaviors. The authors conclude that memory management is a valuable and separate skill that can lead to big improvements in tasks requiring long-term planning.
MetamemoryLarge Language Models (LLMs)Memory managementFile-system operationsAutoMemLong-horizon tasksPrompt engineeringTraining signalProcedurally generated gamesModel proficiency
Authors
Shengguang Wu, Hao Zhu, Yuhui Zhang, Xiaohan Wang, Serena Yeung-Levy
Abstract
Memory expertise is a learned skill: knowing what to encode, when to retrieve, and how to organize knowledge--a capacity known in cognitive science as metamemory. We bring this perspective to LLMs by treating memory management as a trainable skill. We promote file-system operations to first-class memory actions alongside task actions, letting the model itself decide how to manage its memory. This memory skill improves along two axes: the structure that supports it (prompts, file schemas, action vocabulary), and the proficiency of the model exercising it. Both axes resist manual optimization: episodes in long-horizon tasks run for thousands of steps, and a single memory mistake can hide long before it surfaces, making human review of full trajectories impractical. We introduce AutoMem, a framework that automates both axes. In the first loop, a strong LLM reviews complete agent trajectories and iteratively revises the memory structure that shapes how the agent interacts with its memory files. In the second loop, the agent's own good memory decisions are identified from many episodes and used as training signal to sharpen the model's memory proficiency directly. Across three procedurally generated long-horizon games (Crafter, MiniHack, and NetHack), optimizing memory alone--without modifying the model's task-action behavior--improved the base agent's performance ~2x-4x, bringing a 32B open-weight model competitive with frontier systems such as Claude Opus 4.5 and Gemini 3.1 Pro Thinking. Our results show that memory management is an independently learnable skill, and a high-leverage objective yielding large gains on long-horizon tasks.