Sparse Contrastive Learning for Content-Based Cold Item Recommendation
2026-04-14 • Information Retrieval
Information Retrieval
AI summaryⓘ
The authors address the problem of recommending new items that have no prior user interactions in collaborative filtering systems. Instead of trying to fit new items into the existing user-item embedding space, they use item content alone to predict item similarity based on user preferences. They propose a new training method called SEMCo that improves how relevant items are identified by focusing learning on important examples while ignoring less useful ones. Their approach performs better than previous cold-start techniques and may help make recommendations fairer across items.
collaborative filteringcold-start problemembedding spacecontent-based recommendationitem similaritysampled softmaxα-entmax activationknowledge distillationranking accuracyrecommender systems
Authors
Gregor Meehan, Johan Pauwels
Abstract
Item cold-start is a pervasive challenge for collaborative filtering (CF) recommender systems. Existing methods often train cold-start models by mapping auxiliary item content, such as images or text descriptions, into the embedding space of a CF model. However, such approaches can be limited by the fundamental information gap between CF signals and content features. In this work, we propose to avoid this limitation with purely content-based modeling of cold items, i.e. without alignment with CF user or item embeddings. We instead frame cold-start prediction in terms of item-item similarity, training a content encoder to project into a latent space where similarity correlates with user preferences. We define our training objective as a sparse generalization of sampled softmax loss with the $α$-entmax family of activation functions, which allows for sharper estimation of item relevance by zeroing gradients for uninformative negatives. We then describe how this Sampled Entmax for Cold-start (SEMCo) training regime can be extended via knowledge distillation, and show that it outperforms existing cold-start methods and standard sampled softmax in ranking accuracy. We also discuss the advantages of purely content-based modeling, particularly in terms of equity of item outcomes.