Food-R1: A Unified Multi-Task Food Vision-Language Model with Reinforcement Learning

2026-06-03Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created a big new dataset called CalorieBench-80K that has food images with calorie information and diet advice, including step-by-step reasoning for calorie counts. They made a new model called Food-R1 that learns multiple food-related tasks at once. This model is first trained with reasoning instructions and then improved with a special technique to make its guesses better. Tests show Food-R1 works better than previous models on different food tasks. The authors also shared their code and data for others to use.

Vision-Language Modelssupervised fine-tuningChain-of-Thoughtcalorie estimationmulti-task learningreinforcement fine-tuningGroup Relative Policy Optimizationfood image analysisbenchmark datasetinstruction tuning
Authors
Yu Zhu, Yongkang Li, Wenjie Zhu, Haoyi Jiang, Wenyu Liu, Wei Yang, Bin Li, Xinggang Wang
Abstract
Recent studies have explored Vision-Language Models (VLMs) for food analysis. However, most existing methods rely primarily on supervised fine-tuning (SFT), which often limits reasoning and generalization capabilities. Moreover, high-quality large-scale nutritional annotations remain scarce. To address these issues, we introduce CalorieBench-80K, a large-scale benchmark with curated calorie labels and dietary advice annotations. To the best of our knowledge, it is the first food image benchmark to incorporate Chain-of-Thought (CoT) annotations for calorie reasoning. We also propose Food-R1, a unified food VLM trained in a multi-task learning paradigm to equip the model with broad capabilities. Food-R1 undergoes CoT-based cold-start instruction tuning, followed by reinforcement fine-tuning (RFT) using Group Relative Policy Optimization (GRPO) to improve reasoning and performance. Experiments on CalorieBench-80K and representative benchmarks show that Food-R1 consistently outperforms strong baselines across food-related tasks. The code, model weights, and benchmark annotations are available at the project repository.