LLM Zeroth-Order Fine-Tuning is an Inference Workload
2026-05-27 • Machine Learning
Machine Learning
AI summaryⓘ
The authors studied a way to fine-tune large language models without backpropagation, using zeroth-order (ZO) methods that rely on forward scoring of model outputs. They found that current implementations mix training and inference steps inefficiently, so they reorganized the process to treat fine-tuning more like repeated inference calls. This change made the fine-tuning much faster—over 8 times faster in one test—while keeping accuracy high. Their approach also works across different model sizes and can adapt to related methods, suggesting a practical way to train models during inference time instead of separate training runs.
zeroth-order optimizationfine-tuninglarge language modelsbackpropagationinference workloadLoRAruntime optimizationOPT modelsadapter tuningdynamic adaptation
Authors
Zelin Li, Caiwen Ding
Abstract
Zeroth-order (ZO) fine-tuning is attractive for large language models because it replaces backpropagation with forward objective evaluations. Existing implementations nevertheless execute ZO algorithms inside conventional training loops, even though their dominant work is repeated scoring under nearby parameter states. This creates a workload-runtime mismatch: the algorithm asks for structured inference-style scoring, while the system exposes a sequence of fragmented training-loop steps. We show that LLM ZO fine-tuning is an inference-dominated workload and execute its repeated scoring phase through a serving runtime. On OPT-13B SST-2, the resulting vLLM execution path completes the 20k-step LoZO run in 0.51 estimated training hours versus 4.15 hours for the official LoZO baseline under the matched LoRA-only setting, an 8.13x speedup, while reaching 0.922 final evaluation accuracy and 0.931 final full-validation accuracy. In core-step scaling experiments across OPT-1.3B to OPT-13B, the same runtime reorganization gives 2.34x--7.72x speedups. A MeZO-style high-rank factorized experiment shows that the same runtime paradigm can track a MeZO-like loss trajectory while running up to 2.55x faster. More broadly, representing ZO updates as dynamic adapter states suggests a practical path toward inference-time training, where lightweight adaptation can be scheduled as an inference-like workload rather than as a separate training job.