EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents
2026-06-09 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors introduce EEVEE, a new way for large language models to learn prompts at test time when facing tasks from many different sources all at once. Unlike earlier methods that work on just one type of data, EEVEE uses a special router to group similar tasks and choose the best prompts for each group. They train the router and prompts together to make the system better at handling varied inputs. Tests show EEVEE helps the model perform more reliably across mixed tasks without losing speed or accuracy on single tasks.
Large Language ModelsPrompt LearningTest-time LearningMulti-datasetTask ClusteringRouter MechanismCo-evolution StrategyCross-dataset InterferenceBenchmarkingModel Robustness
Authors
Weixian Xu, Shilong Liu, Mengdi Wang
Abstract
In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.