MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
2026-04-20 • Artificial Intelligence
Artificial IntelligenceDigital LibrariesInformation RetrievalMachine Learning
AI summaryⓘ
The authors created MathNet, a big collection of hard math problems from many countries and languages, designed to test how well AI models can solve and find math problems. This dataset includes over 30,000 expert-made problems and also pairs of similar problems to help check how well models can find related math questions. They tested top AI models and found that solving and retrieving math problems is still difficult for them. The authors also showed that when models use retrieved problems to help solve new ones, the solution quality depends heavily on how well the retrieval works. MathNet is the largest and most diverse resource for evaluating math problem solving and retrieval in AI.
Mathematical problem solvingMultimodal datasetsMultilingual datasetsMath retrievalGenerative modelsEmbedding modelsOlympiad-level mathRetrieval-augmented generationBenchmark datasetsMathematical reasoning
Authors
Shaden Alshammari, Kevin Wen, Abrar Zainal, Mark Hamilton, Navid Safaei, Sultan Albarakati, William T. Freeman, Antonio Torralba
Abstract
Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.