Reward-Based Online LLM Routing via NeuralUCB
2026-03-31 • Machine Learning
Machine LearningComputation and Language
AI summaryⓘ
The authors studied a way to choose which large language model to use by balancing quality and cost using a method called NeuralUCB. They tested this method in a simulated environment and found it usually worked better than just picking models randomly or by lowest cost. Compared to always picking the best quality model, their method saved a lot of computational cost while still performing well. The authors also point out that some challenges remain in how the system explores different options and tells models apart.
NeuralUCBlarge language modelsmodel routingcost-aware decision makingsupervised routingpartial-feedback methodsinference costutility rewardexploration-exploitation
Authors
Ming-Hua Tsai, Phat Tran
Abstract
This study investigates the use of NeuralUCB for cost-aware large language model (LLM) routing. Existing routing approaches can be broadly grouped into supervised routing methods and partial-feedback methods, each with different tradeoffs in efficiency and adaptivity. We implement a NeuralUCB-based routing policy and evaluate it on RouterBench under a simulated online setting. Experimental results show that the proposed method consistently outperforms random and min-cost baselines in utility reward. Compared with the max-quality reference, our method achieves substantially lower inference cost while maintaining competitive reward. These findings suggest that NeuralUCB is a promising approach for cost-aware LLM routing, while also highlighting remaining challenges in action discrimination and exploration.