Adaptive Simulation Experiment for LLM Policy Optimization
2026-04-09 • Machine Learning
Machine Learning
AI summaryⓘ
The authors explore how large language models (LLMs) can be guided by policies to improve how well they work in operations management. They treat LLMs like simulators that give different results randomly and use pairwise comparisons to find the best policy from a set of options. They study two ways of organizing policies: one with no structure and another based on a preference model, and develop a method called LLM-PO that adaptively tests policies and efficiently finds the best one with strong statistical guarantees. Their experiments show LLM-PO works better than existing methods in improving LLM responses.
Large Language ModelsOperations ManagementPolicy OptimizationStochastic SimulationPairwise ComparisonAdaptive ExperimentationPreference ModelConvex OptimizationStatistical GuaranteeSampling Proportions
Authors
Mingjie Hu, Siyang Gao, Jian-qiang Hu, Enlu Zhou
Abstract
Large language models (LLMs) have significant potential to improve operational efficiency in operations management. Deploying these models requires specifying a policy that governs response quality, shapes user experience, and influences operational value. In this research, we treat LLMs as stochastic simulators and propose a pairwise comparison-based adaptive simulation experiment framework for identifying the optimal policy from a finite set of candidates. We consider two policy spaces: an unstructured space with no parametric assumption, and a structured space in which the data are generated from a preference model. For both settings, we characterize the fundamental data requirements for identifying the optimal policy with high probability. In the unstructured case, we derive a closed-form expression for the optimal sampling proportions, together with a clear operational interpretation. In the structured case, we formulate a regularized convex program to compute the optimal proportions. We then develop an adaptive experimental procedure, termed LLM-PO, for both policy spaces, and prove that it identifies the optimal policy with the desired statistical guarantee while asymptotically attaining the fundamental data requirements. Numerical experiments demonstrate that LLM-PO consistently outperforms benchmark methods and improves LLM performance.