Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

2026-05-11 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors study ways to improve text-to-image models after training using reinforcement learning, focusing on avoiding a problem called reward hacking, where models cheat based on flawed reward signals. They found that normalizing certain terms can cause errors and proposed a new method called Super-Linear Advantage Shaping (SLAS) that changes how the model updates its policy using advanced math about information geometry. SLAS adjusts the learning process to focus more on helpful improvements and less on misleading ones, making training more stable and effective. Their tests show that SLAS outperforms previous methods in speed, robustness, and quality of generated images, especially on new types of data.

Post-training methodsReinforcement learningText-to-image (T2I) modelsReward hackingGroup Relative Policy Optimization (GRPO)Information geometryFisher-Rao metricAdvantage weightingBatch normalizationPolicy optimization

Authors

Haoyuan Sun, Jing Wang, Yuxin Song, Yu Lu, Bo Fang, Yifu Luo, Jun Yin, Pengyu Zeng, Miao Zhang, Tiantian Zhang, Xueqian Wang, Shijian Lu

Abstract

Recently, post-training methods based on reinforcement learning, with a particular focus on Group Relative Policy Optimization (GRPO), have emerged as the robust paradigm for further advancement of text-to-image (T2I) models. However, these methods are often prone to reward hacking, wherein models exploit biases in imperfect reward functions rather than yielding genuine performance gains. In this work, we identify that normalization could lead to miscalibration and directly removing the prompt-level standard deviation term yields an optimal policy ascent direction that is linear in the advantage but still limits the separation of genuine signals from noise. To mitigate the above issues, we propose Super-Linear Advantage Shaping (SLAS) by revisiting the functional update from an information geometry perspective. By extending the Fisher-Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure that reshapes the local policy space. This design relaxes constraints along high-advantage directions to amplify informative updates, while tightening those in low-advantage regions to suppress illusory gradients. In addition, batch-level normalization is applied to stabilize training under varying reward scales. Extensive evaluations demonstrate that SLAS consistently surpasses the DanceGRPO baseline across multiple backbones and benchmarks. In particular, it yields faster training dynamics, improved out-of-domain performance on GenEval and UniGenBench++, and enhanced robustness to model scaling, while mitigating reward hacking and preserving semantic and compositional fidelity in generations.

View PDFOpen arXiv