Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

2026-04-28Computation and Language

Computation and LanguageSoftware Engineering
AI summary

The authors address the challenge of improving the setups (called harnesses) that help coding AI agents work better with tools and code environments. They propose a new method called Agentic Harness Engineering (AHE) that adds clear ways to track and understand each change made to these setups, so the AI can learn from its edits without just guessing. Their approach uses three levels of observing changes, outcomes, and decisions to make harness improvements more reliable. Testing shows that AHE improves coding agent performance beyond existing human-made and automated methods, and these improvements work well across different models without extra tweaks.

coding agentsharness engineeringAgentic Harness Engineering (AHE)observabilitytrajectory inspectioncomponent editingpass@1Terminal-Bench 2self-declared predictionbenchmark transfer
Authors
Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui
Abstract
Harnesses have become a central determinant of coding-agent performance, shaping how models interact with repositories, tools, and execution environments. Yet automating harness engineering is hard: a heterogeneous action space, sparse and noisy evaluation signal, multi-million-token trajectories, and edits whose effect is hard to attribute to the next round's outcomes. We introduce Agentic Harness Engineering (AHE), a framework that automates harness-level evolution by instrumenting the three stages of any engineering loop (component editing, trajectory inspection, and decision making) with matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. These results position observability-driven evolution as a practical pathway to keep coding-agent harnesses continually improving.