MathDuels: Evaluating LLMs as Problem Posers and Solvers

2026-04-23Computation and Language

Computation and LanguageSoftware Engineering
AI summary

The authors created MathDuels, a new benchmark to better measure math skills of language models by having them both write and solve math problems against each other. Unlike old tests that only check if models can solve fixed problems, MathDuels lets models create harder problems to challenge others, with a separate step making sure problems are clear and fair. The authors use a statistical method to estimate both how good models are at solving problems and how tough their authored problems are. Their tests show that being good at writing math problems and solving them are partly separate skills, and this approach helps tell models apart better than older tests. The benchmark keeps getting harder as new models join, and results are tracked publicly.

language modelsbenchmarkself-playadversarial promptingproblem generationRasch modelsolver abilityproblem difficultymeta-promptingdifficulty amplification
Authors
Zhiqiu Xu, Shibo Jin, Shreya Arya, Mayur Naik
Abstract
As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems. Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled, and that dual-role evaluation reveals capability separations invisible in single-role benchmarks. As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark's difficulty co-evolves with participant strength rather than saturating at a fixed ceiling. We host a public leaderboard that updates as new models are released.