Aletheia tackles FirstProof autonomously

2026-02-24Artificial Intelligence

Artificial IntelligenceComputation and LanguageMachine Learning
AI summary

The authors tested Aletheia, a math-solving AI using Gemini 3 Deep Think, in a math problem contest called FirstProof. Aletheia was able to solve 6 out of 10 problems by itself within the contest time. The authors shared all their methods and results openly, including cases where expert judges did not fully agree on one problem. They also provided all of their problem-solving examples online for others to see.

mathematics research agentAI problem solvingFirstProof challengeGemini 3 Deep Thinkautonomous solvingexpert assessmentmachine learningbenchmark evaluation
Authors
Tony Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Gukov, Chiang-Chiang Tsai, David Woodruff, Adel Javanmard, Aryan Mokhtari, Dawsen Hwang, Yuri Chervonyi, Jonathan N. Lee, Garrett Bingham, Trieu H. Trinh, Vahab Mirrokni, Quoc V. Le, Thang Luong
Abstract
We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge. Within the allowed timeframe of the challenge, Aletheia autonomously solved 6 problems (2, 5, 7, 8, 9, 10) out of 10 according to majority expert assessments; we note that experts were not unanimous on Problem 8 (only). For full transparency, we explain our interpretation of FirstProof and disclose details about our experiments as well as our evaluation. Raw prompts and outputs are available at https://github.com/google-deepmind/superhuman/tree/main/aletheia.