Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
2026-07-01 • Software Engineering
Software EngineeringArtificial Intelligence
AI summaryⓘ
The authors studied three coding benchmarks that test how well automated coding agents improve software performance by comparing their patches to official ones. They found that many official benchmark patches don't consistently show improvements across different machines, and scoring rules can unfairly affect which agents seem better. They also discovered that most tasks are already matched or beaten by at least one public submission, meaning some rankings might be misleading. Their work helps clarify which benchmark results are more reliable and highlights hidden gaps in performance evaluation.
coding benchmarksperformance optimizationruntime performancebenchmark scoring rulessoftware patchesruntime instabilityleaderboardsGoogle Cloud machinestask validityautomated coding agents
Authors
Zhi Chen, Zhensu Sun, Yuling Shi, David Lo, Lingxiao Jiang
Abstract
Repository-level performance-optimization benchmarks such as GSO, SWE-Perf and SWE-fficiency evaluate coding agents by applying patches to real repositories and comparing runtime against unoptimized baselines and official reference patches. Their leaderboard scores are increasingly used as evidence of coding-agent progress, but those scores can conflate runtime instability, benchmark-specific scoring rules, and how many tasks are already solved by at least one public submission. We audit these issues across the three benchmarks. First, we replay the official reference patches for 740 code optimization tasks across four common types of Google Cloud machines. Most benchmark tasks can be replayed, but their reference patches satisfy the original benchmark validity rules in every cross-machine replay for only 39/102 GSO tasks, 11/140 SWE-Perf tasks, and 411/498 SWE-fficiency tasks; SWE-Perf is especially fragile because many reference patches produce close-to-zero runtime changes. Second, we show that public submission rankings depend strongly on the benchmark scoring rule. Among eight public submissions shared by GSO and SWE-fficiency, the official rankings disagree on 9 of 28 pairwise submission comparisons, and SWE-fficiency's leaderboard scoring rule assigns the worst ten tasks overly high score weights of 58.5%-82.8%. Third, looking across 10 public submissions for each task, we find that at least one submission matches or beats the reference patch on 85.3% (384/450) of replay-valid GSO and SWE-fficiency tasks, and beats the unoptimized base code on 99.8% (449/450). Our study complements leaderboard scores by identifying tasks with more reliable performance signals, quantifying per-task score contributions, and exposing the remaining performance gaps that are hidden by aggregate rankings.