ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control
2026-04-22 • Machine Learning
Machine LearningComputer Vision and Pattern Recognition
AI summaryⓘ
The authors address a limitation in reinforcement learning (RL) for generative models, where usually only one combined reward is used, making it hard to balance different goals. They introduce ParetoSlider, a method that trains one model to understand various trade-offs between competing objectives at the same time. This allows users to choose how to balance goals during use, without needing to retrain the model for each preference. Their approach works well across different model architectures and offers better control compared to traditional methods that fix rewards beforehand.
Reinforcement LearningGenerative ModelsMulti-Objective RLPareto FrontDiffusion ModelsReward ScalarizationFlow MatchingPreference Conditioning
Authors
Shelly Golan, Michael Finkelson, Ariel Bereslavsky, Yotam Nitzan, Or Patashnik
Abstract
Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.