Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation

2026-02-16Computation and Language

Computation and LanguageInformation Retrieval
AI summary

The authors propose a method to improve news recommendations by using reinforcement learning to teach large language models how to create lists of search queries based on a user's interests from multiple types of information. They treat the problem as optimizing a policy and experiment with different computing settings, finding better results as computing power increases. To make their solution practical, they also compress the large model into a smaller one without losing much performance. Their approach was tested thoroughly and showed better understanding of user interests and improved news recommendations in real-world systems.

news recommendationreinforcement learninglarge language modelscross-domain signalspolicy optimizationGRPOon-policy distillationmodel scalingA/B testinguser interest modeling
Authors
Mengdan Zhu, Yufan Zhao, Tao Di, Yulan Yan, Liang Zhao
Abstract
News recommendation plays a critical role in online news platforms by helping users discover relevant content. Cross-domain news recommendation further requires inferring user's underlying information needs from heterogeneous signals that often extend beyond direct news consumption. A key challenge lies in moving beyond surface-level behaviors to capture deeper, reusable user interests while maintaining scalability in large-scale production systems. In this paper, we present a reinforcement learning framework that trains large language models to generate high-quality lists of interest-driven news search queries from cross-domain user signals. We formulate query-list generation as a policy optimization problem and employ GRPO with multiple reward signals. We systematically study two compute dimensions: inference-time sampling and model capacity, and empirically observe consistent improvements with increased compute that exhibit scaling-like behavior. Finally, we perform on-policy distillation to transfer the learned policy from a large, compute-intensive teacher to a compact student model suitable for scalable deployment. Extensive offline experiments, ablation studies and large-scale online A/B tests in a production news recommendation system demonstrate consistent gains in both interest modeling quality and downstream recommendation performance.