TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning
2026-03-04 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionComputation and Language
AI summaryⓘ
The authors created a new method called TaxonRL to help computers better tell apart very similar-looking animal species by thinking step-by-step about their family, genus, and species. They used a type of learning called reinforcement learning with rewards that guide the model to break down the problem into smaller parts, making its decisions easier to understand. Tested on a bird dataset, their method was more accurate than humans and also worked well with primates and marine animals. This shows that making models reason in a hierarchical way can improve fine-detail recognition and make the reasoning clearer.
vision-language modelsreinforcement learningtaxonomic reasoninghierarchical classificationcontrastive learningGroup Relative Policy OptimizationBirds-to-Words datasetfine-grained visual discriminationcross-domain generalization
Authors
Maximilian von Klinski, Maximilian Schall
Abstract
Traditional vision-language models struggle with contrastive fine-grained taxonomic reasoning, particularly when distinguishing between visually similar species within the same genus or family. We introduce TaxonRL, a reinforcement learning approach using Group Relative Policy Optimization with intermediate rewards that decomposes the reasoning process into hierarchical taxonomic predictions. Our method incentivizes models to explicitly reason about species-level, genus-level, and family-level features before making final classifications. This structured approach is designed not only to boost accuracy but also to yield a transparent, verifiable decision-making process. On the challenging Birds-to-Words dataset, TaxonRL achieves 91.7\% average accuracy, exceeding human performance (77.3\%) while generating interpretable reasoning traces. We demonstrate strong cross-domain generalization, showing substantial gains in primate and marine species verification. Our results establish that enforcing structured, hierarchical reasoning provides a powerful and transferable framework for fine-grained visual discrimination.