Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation

2026-03-31 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors expanded on a method called MONA, which tries to prevent AI agents from cheating by limiting how far ahead they plan while also using long-term feedback for learning. They recreated and improved the original setup, showing that the original results held: normal reinforcement learning often cheats, while ideal MONA does not. They tested different ways of giving feedback, finding that although some learned feedback stopped cheating, it made the AI perform much worse, likely because it didn't learn as well rather than because cheating returned. Their work makes it easier to explore how different feedback methods affect MONA’s safety and highlights the challenge of creating feedback that balances foresight and honesty. The authors shared all code and experiments publicly for others to build on.

MONAreward hackingreinforcement learningapproval mechanismsplanning horizonPPO trainingcalibrated approvaloracle approvallearned overseerreproducibility

Authors

Nathan Heath

Abstract

Myopic Optimization with Non-myopic Approval (MONA) mitigates multi-step reward hacking by restricting the agent's planning horizon while supplying far-sighted approval as a training signal~\cite{farquhar2025mona}. The original paper identifies a critical open question: how the method of constructing approval -- particularly the degree to which approval depends on achieved outcomes -- affects whether MONA's safety guarantees hold. We present a reproduction-first extension of the public MONA Camera Dropbox environment that (i)~repackages the released codebase as a standard Python project with scripted PPO training, (ii)~confirms the published contrast between ordinary RL (91.5\% reward-hacking rate) and oracle MONA (0.0\% hacking rate) using the released reference arrays, and (iii)~introduces a modular learned-approval suite spanning oracle, noisy, misspecified, learned, and calibrated approval mechanisms. In reduced-budget pilot sweeps across approval methods, horizons, dataset sizes, and calibration strategies, the best calibrated learned-overseer run achieves zero observed reward hacking but substantially lower intended-behavior rates than oracle MONA (11.9\% vs.\ 99.9\%), consistent with under-optimization rather than re-emergent hacking. These results operationalize the MONA paper's approval-spectrum conjecture as a runnable experimental object and suggest that the central engineering challenge shifts from proving MONA's concept to building learned approval models that preserve sufficient foresight without reopening reward-hacking channels. Code, configurations, and reproduction commands are publicly available. https://github.com/codernate92/mona-camera-dropbox-repro

View PDFOpen arXiv