ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

2026-06-01Artificial Intelligence

Artificial IntelligenceComputation and LanguageEmerging TechnologiesMultiagent Systems
AI summary

The authors introduce ClinEnv, a new tool to test how well large language models (LLMs) can act like doctors during hospital stays by making step-by-step decisions. Unlike past tests, ClinEnv makes models gather different pieces of information by asking specialized experts before deciding on treatments and diagnoses. Their study shows even the best models struggle with treatment choices and tend to ask repetitive questions, revealing a gap in how they collect information versus the final outcomes. ClinEnv helps highlight this issue by measuring both decision accuracy and the information-gathering process.

LLMclinical decision-makinginteractive benchmarkinpatient simulationdecision stagesinformation gatheringF1 scoremedical managementontology matchingdiagnosis
Authors
Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo, Xukai Zhao, Jinzhuo Wang, May Dongmei Wang
Abstract
Clinical practice is not the selection of an answer from enumerated options: a physician gathers heterogeneous information incrementally and commits to sequential, irreversible decisions under uncertainty. Static benchmarks cannot probe and existing interactive medical benchmarks each compromise on at least one of them. We present ClinEnv, an interactive benchmark that evaluates LLMs as attending physicians over real inpatient admissions under a paradigm we term Longitudinal Inpatient Simulation. Each case is automatically constructed into an ordered sequence of decision stages; at every stage the model must actively query four specialized agents before committing to medications, procedures, and diagnoses. ClinEnv scores both what the model decides, through deterministic ontology-grounded matching, and how it gathers information. Across seven models, the strongest reaches only 0.31 decision F1, and outcome quality is sharply decoupled from process quality. Difficulty concentrates in management decisions and later stages, where models recover discharge diagnoses far more reliably than management actions (0.51 vs. 0.17 F1) and continue to issue redundant queries as cases progress. ClinEnv makes this information-acquisition gap, invisible to outcome-only evaluation, directly measurable.