Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models
2026-03-12 • Machine Learning
Machine Learning
AI summaryⓘ
The authors propose a new way to fine-tune language models that helps them learn better by focusing on the overall sequence of words they produce rather than just predicting the next word. Their method, called energy-based fine-tuning (EBFT), uses a clever sampling technique to look at many possible sentence continuations at once and improves the model based on features extracted from these sequences. They show that this approach works as well as some reinforcement learning methods and better than standard fine-tuning, especially in tasks like coding and translation. The method also has a solid theoretical foundation linking it to known statistical models.
Cross-entropyLanguage modelFine-tuningSequence-level learningEnergy-based modelsPolicy gradientTeacher forcingRolloutsFeature matchingKL divergence
Authors
Samy Jelassi, Mujin Kwun, Rosie Zhao, Yuanzhi Li, Nicolo Fusi, Yilun Du, Sham M. Kakade, Carles Domingo-Enrich
Abstract
Cross-entropy (CE) training provides dense and scalable supervision for language models, but it optimizes next-token prediction under teacher forcing rather than sequence-level behavior under model rollouts. We introduce a feature-matching objective for language-model fine-tuning that targets sequence-level statistics of the completion distribution, providing dense semantic feedback without requiring a task-specific verifier or preference model. To optimize this objective efficiently, we propose energy-based fine-tuning (EBFT), which uses strided block-parallel sampling to generate multiple rollouts from nested prefixes concurrently, batches feature extraction over these rollouts, and uses the resulting embeddings to perform an on-policy policy-gradient update. We present a theoretical perspective connecting EBFT to KL-regularized feature-matching and energy-based modeling. Empirically, across Q&A coding, unstructured coding, and translation, EBFT matches RLVR and outperforms SFT on downstream accuracy while achieving a lower validation cross-entropy than both methods.