WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

2026-05-11 • Computation and Language

Computation and Language

AI summaryⓘ

The authors created WildClawBench, a new test to see how well language and vision-language AI agents can complete complex, real-world tasks using command-line tools instead of simple made-up tests. These tasks take around 8 minutes each and involve multiple tool uses, running inside realistic environments rather than simulations. They found that even the best current models struggled to do well, with scores mostly under 60%, showing that handling long, realistic tasks is still very challenging. The authors also provide all their tasks and code so others can repeat the tests.

large language modelsvision-language modelscommand-line interface (CLI)agent benchmarkslong-horizon tasksDocker containershybrid gradingOpenClawsemantic verificationreproducible evaluation

Authors

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Yang JingYi, Penghui Yang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, Yuhang Zang

Abstract

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.

View PDFOpen arXiv