OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics
2026-06-08 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors created a new way to test vision-language model agents by building twelve different games in Unreal Engine 5, including solo, player-versus-player, and cooperative modes. Their benchmark, OmniGameArena, allows fair comparison of different types of agents using a common set of rules. They also introduced the Improvement Dynamics Curve (IDC), which tracks how agents improve their skills through multiple rounds of self-reflection. This approach measures not only the first try but also how agents learn and perform on new, unseen tasks. They tested twelve agents and showed both initial scores and how agents evolve over time using IDC.
vision-language modelsgame benchmarksUnreal Engine 5Solo playPvPcooperative playagent evaluationImprovement Dynamics Curveagentic reflectionskill refinement
Authors
Mingxian Lin, Shengju Qian, Yuqi Liu, Yi-Hua Huang, Yiyu Wang, Wei Huang, Yitang Li, Fan Zhang, Zeyu Hu, Lingting Zhu, Xin Wang, Xiaojuan Qi
Abstract
Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.