EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms
2026-06-02 • Machine Learning
Machine LearningArtificial IntelligenceDistributed, Parallel, and Cluster Computing
AI summaryⓘ
The authors address a problem in fine-tuning large language models using reinforcement learning from human feedback (RLHF), where the reward used for optimization starts to mislead the training over time. They propose EvalStop, a new method that stops training early when repeated drops in evaluation scores are detected, saving computing resources and keeping the best model version. Their approach works well in simulations mixing normal and problematic training runs, improving job completion times and reducing wasted computation better than other methods. EvalStop can be combined with various schedulers and remains effective even with some noise in the evaluation scores.
Large Language ModelsFine-tuningReinforcement Learning from Human Feedback (RLHF)Reward OveroptimizationSchedulerEarly StoppingCheckpointingJob Completion Time (JCT)Evaluation MetricsCompute Efficiency
Authors
Guilin Zhang, Chuanyi Sun, Shahryar Sarkani, John M. Fossaceca
Abstract
Cloud LLM fine-tuning platforms increasingly serve RLHF workloads, where a learned reward model is optimized as a proxy for human quality. As Gao et al. (2023) showed, this proxy diverges from world feedback (downstream eval metrics) under sustained optimization pressure, a phenomenon known as reward overoptimization. Existing platform schedulers ignore this divergence: non-clairvoyant schedulers optimize JCT without any quality signal, SLAQ-style quality-aware schedulers use training loss (a weaker proxy that drops monotonically through hacking), and classical per-job early stopping requires human monitoring and does not free shared GPUs. We propose EvalStop, a composable scheduling primitive that terminates jobs on k consecutive eval-score declines, releases GPUs, preserves the best checkpoint, and delegates to any base scheduler. We frame scheduler-level early stopping as a detection problem and evaluate it in a discrete-event simulator whose RLHF workload mixes reward-hacking and structurally healthy runs, with ground-truth labels hidden from schedulers. On RLHF-heavy workloads (80% RLHF, 64 GPUs), EvalStop achieves precision 98% / recall 99% / FPR 1.5% while improving JCT by 9% and cutting wasted compute by 22% over SRTF-Est (p<0.05). Trivial fixed-progress and loss-plateau competitors either incur 65% FPR on healthy RLHF or miss over half of true hacking cases. Gains compose across every base scheduler tested (9-25% JCT) and detection quality stays stable under eval noise (precision at least 91% at noise std <= 0.05) and hacking base rate (precision at least 89% across 20-80% hacking fractions).