RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

2026-05-11 • Robotics

Robotics

AI summaryⓘ

The authors created RoboMemArena, a new benchmark to help robots improve their memory for long and complex tasks involving many steps. This benchmark includes 26 tasks that require robots to remember past actions and observations, and features detailed memory-related annotations. They also developed PrediMem, a system that uses a planner and memory buffers to better handle task dynamics. Tests showed PrediMem works better than previous methods and offers useful ideas about managing robot memory and designing memory-based models.

robotic memorybenchmarkmultimodal annotationsvision-language modelmemory-dependent taskstrajectorymemory bufferspredictive codingtask dynamicslong-horizon tasks

Authors

Huashuo Lei, Wenxuan Song, Huarui Zhang, Jieyuan Pei, Jiayi Chen, Haodong Yan, Han Zhao, Pengxiang Ding, Zhipeng Zhang, Lida Huang, Donglin Wang, Yan Wang, Haoang Li

Abstract

Memory is a critical component of robotic intelligence, as robots must rely on past observations and actions to accomplish long-horizon tasks in partially observable environments. However, existing robotic memory benchmarks still lack multimodal annotations for memory formation, provide limited task coverage and structural complexity, and remain restricted to simulation without real-world evaluation. We address this gap with RoboMemArena, a large-scale benchmark of 26 tasks, with average trajectory lengths exceeding 1,000 steps per task and 68.9% of subtasks being memory-dependent. The generation pipeline leverages a vision-language model (VLM) to design and compose subtasks, generates full trajectories through atomic functions, and provides memory-related annotations, including subtask instructions and native keyframe annotations, while paired real-world memory tasks support physical evaluation. We further design PrediMem, a dual-system VLA in which a high-level VLM planner manages a memory bank with recent and keyframe buffers and uses a predictive coding head to improve sensitivity to task dynamics. Extensive experiments on RoboMemArena show that PrediMem outperforms all baselines and provides insights into memory management, model architecture, and scaling laws for complex memory systems.

View PDFOpen arXiv