PhysDox: Benchmarking LLMs on Physical Feasibility Auditing of Physiological Sensing Protocols

2026-06-03Human-Computer Interaction

Human-Computer Interaction
AI summary

The authors created PhysDox, a test to check if biomedical experiment instructions are physically possible to do. They tested six large language models to see if these models can spot small or big problems in the protocols but found the models only had moderate success. The study shows that models often confuse whether a protocol is complete with whether it is actually doable, missing many hidden practical issues. The authors suggest that spotting these problems requires careful reasoning about what's physically feasible, not just knowing facts or writing longer explanations.

large language modelsbiomedical protocolsphysical feasibilitybenchmarkerror analysisconstraint violationmacro-F1 scoreprotocol auditingattention failurejudgment failure
Authors
He Liu, Boyuan Gu, Shuaiqi Cheng, Haiyang Sun, Siyu You, Xuming Hu
Abstract
Large language models (LLMs) increasingly assist in experimental design, yet fluent protocols often remain physically infeasible. We introduce PhysDox, a physical feasibility auditing benchmark for biomedical protocols comprising a 683-sample expert-curated Gold set and a 5,000-sample Silver set across six sensing domains. We formulate the task as a two-stage evaluation: severity detection classifying protocols as valid, minor, or fatal, followed by the constraint-level diagnosis of fatal violations. Evaluating 6 LLMs across 4 inference strategies yields a peak Stage-1 macro-F1 of only 53.0. Moreover, strong oracle diagnosis collapses during end-to-end evaluation due to correlated cascade errors. Error analysis reveals scaffold bias, where models conflate procedural completeness with physical validity. Consequently, implicit constraints exhibit a 2 times higher miss rate than explicit hardware violations, supported by strong statistical correlation at $ρ{=}0.81$ and $p{<}0.01$. Trace analysis of false negatives exposes a 54%--46% split between attention and judgment failures, ultimately demonstrating that protocol auditing demands calibrated feasibility reasoning rather than factual recall or longer rationales.