Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA

2026-06-02Computation and Language

Computation and LanguageArtificial Intelligence
AI summary

The authors created DOSEBENCH, a test made up of 81 tricky questions about how adults can safely take common over-the-counter medicines like acetaminophen and ibuprofen. They checked how well four large language models handled these questions, focusing on important skills like keeping track of timing and following dosing rules. The authors found that the models often had trouble with tracking doses over time and dealing with unclear information, and sometimes gave confident answers that were actually wrong. This shows that these kinds of questions are good for testing how well AI can handle timing, rules, and uncertainty in medical advice.

Large Language ModelsOver-the-Counter MedicationAcetaminophenIbuprofenDosing GuidelinesTemporal ReasoningMedical Question AnsweringBenchmarkingSafety ConstraintsUncertainty Handling
Authors
Maroof Kousar, Yibo Hu
Abstract
Large language models (LLMs) are increasingly used for everyday health questions, including whether a user can safely take another dose of an over-the-counter (OTC) medication. Yet this common safety-relevant setting remains underexplored in existing medical QA evaluations, where correct answers require tracking dose timing, computing rolling 24-hour intake, following product-label constraints, and handling incomplete medication histories. We introduce DOSEBENCH, a focused benchmark of 81 curated OTC dosing scenarios focused on adult acetaminophen and ibuprofen use, with manually annotated gold references. We evaluate four LLMs across repeated runs using metrics for decision correctness, consistency, explanation verifiability, failure types, and confidence-related signals, resulting in 1,620 model responses. Our results show that models frequently struggle with rolling-window reasoning and ambiguity-sensitive cases and that stable or confident-looking responses can still violate dosing constraints. These findings suggest that OTC dosing QA provides a narrow yet practical testbed for evaluating temporal reasoning, constraint following, and safety-relevant uncertainty handling in medical QA.