AI summaryⓘ
The authors created GeoCodeBench, a test to see how well AI models can write computer code for 3D vision tasks, which are important in advanced research. They collected real problems from recent papers, focused on key 3D geometry coding parts, and made tests that automatically check if the code works. Testing several AI models showed most have a low success rate, with the best reaching only 36.6% correctness, showing this is still a hard area. They also found that giving the AI only parts of a paper (up to the Method section) led to better results than entire papers, indicating difficulties in understanding long, complex scientific texts. This benchmark helps measure and guide AI improvements specifically for 3D scientific coding.
AI-assisted coding3D geometric visionbenchmarkunit testsGPT-5geometric transformationsalgorithm implementationscientific comprehensionMethod sectioncode evaluation
Authors
Wenyi Li, Renkai Luo, Yue Yu, Huan-ang Gao, Mingju Gao, Li Yuan, Chaoyou Fu, Hao Zhao
Abstract
AI-assisted coding has rapidly reshaped software practice and research workflows, yet today's models still struggle to produce correct code for complex 3D geometric vision. If models could reliably write such code, the research of our community would change substantially. To measure progress toward that goal, we introduce GeoCodeBench, a PhD-level benchmark that evaluates coding for 3D vision. Each problem is a fill-in-the-function implementation task curated from representative papers at recent venues: we first let a tool propose candidate functions from official repositories, then perform careful human screening to select core 3D geometric components. For every target, we generate diverse, edge-case unit tests, enabling fully automatic, reproducible scoring. We evaluate eight representative open- and closed-source models to reflect the current ecosystem. The best model, GPT-5, attains only 36.6% pass rate, revealing a large gap between current capabilities and dependable 3D scientific coding. GeoCodeBench organizes tasks into a two-level hierarchy: General 3D capability (geometric transformations and mechanics/optics formulation) and Research capability (novel algorithm implementation and geometric logic routing). Scores are positively correlated across these axes, but research-oriented tasks are markedly harder. Context ablations further show that "more paper text" is not always better: cutting off at the Method section statistically outperforms full-paper inputs, highlighting unresolved challenges in long-context scientific comprehension. Together, these findings position GeoCodeBench as a rigorous testbed for advancing from generic coding to trustworthy 3D geometric vision coding.