Awakening the Sleeping Agent: Lean-Specific Agentic Data Reactivates General Tool Use in Goedel Prover

2026-04-09Artificial Intelligence

Artificial Intelligence
AI summary

The authors studied a math-focused AI model that, after heavy fine-tuning on one area, lost its ability to use external tools even when asked. They found that this loss is not permanent, as retraining the model on just 100 examples of using tools in a related math setting restored much of this ability. This small retraining also improved the model's performance on different tasks outside the retraining domain. Their work shows that fine-tuning can hide useful skills but a little targeted data can bring those skills back.

fine-tuningformal mathematicstool useGoedel-Prover-V2Lean theorem proveragentic collapsefunction-calling accuracyMathlibbenchmarkProofNet
Authors
Jui-Hui Chung, Hongzhou Lin, Lai Jiang, Shange Tang, Chi Jin
Abstract
Heavy supervised fine-tuning on a target domain can strongly suppress capabilities that were present in the base model. We study this phenomenon in formal mathematics using Goedel-Prover-V2, an open-source model heavily trained on 1.8 million formal-math examples. After domain specialization, the model almost completely loses its ability to produce valid tool calls, even when explicitly instructed to use tools, dropping from 89.4% function-calling accuracy in the base model to nearly 0%. We ask whether this agentic collapse is permanent or instead reversible. To answer this question, we fine-tune the specialized model on a small amount of Lean-specific tool-use data. Remarkably, as few as 100 agentic traces are sufficient to restore strong tool-calling behavior. Importantly, this recovery is not the result of reward hacking or benchmark-specific optimization: the recovery data is entirely drawn from the Lean setting, where the model uses natural-language queries to search the Mathlib library for relevant theorems and lemmas, yet the regained capability transfers well beyond that domain. In particular, these same 100 Lean-specific traces improve performance on the Berkeley Function Calling Leaderboard from near zero to 83.8%, approaching the base model's 89.4% despite the mismatch in task distribution and protocol. The recovered capability is also practically useful in-domain. On ProofNet, pass@32 improves from 21.51% to 25.81%. Together, these results show that heavy domain supervised fine-tuning can suppress general tool-use ability without permanently erasing it, and that a small amount of domain-specific agentic data can awaken dormant tool-use capabilities.