CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
2026-05-27 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors introduce a new method called Contrastive Reflection (CORE) to help language models improve their reasoning using fewer examples and attempts. CORE works by comparing past reasoning attempts, noticing what worked and what didn’t, and turning those observations into short, easy-to-understand descriptions. This method improves learning speed and requires less computing power than other approaches. The authors tested CORE on four reasoning tasks and found it learns faster and is more efficient in using memory and prompts. They suggest that using these natural language insights is a simpler and clearer way for models to get better at reasoning.
language modelsreasoning tasksnon-parametric learningContrastive Reflection (CORE)reasoning tracesparametric methodsprompt optimizationmodel rolloutsself-improvementcontext efficiency
Authors
Linas Nasvytis, Simon Jerome Han, Ben Prystawski, Satchel Grant, Noah D. Goodman, Judith E. Fan
Abstract
Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.