From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

2026-06-02Artificial Intelligence

Artificial Intelligence
AI summary

The authors created ChemCoTBench-V2, a new test to check how well language models explain their chemistry answers step-by-step, not just if the final answer is right. This test uses clear rules to verify each part of the model’s reasoning, avoiding expensive human checks or unreliable judgments from other AI. It covers different chemistry tasks and highlights when a model gives correct answers but makes mistakes in the reasoning process. Their experiments show that models often produce right answers without proper chemical logic, and this benchmark helps spot exactly where they go wrong.

large language modelschemistry reasoningbenchmarkmolecular optimizationreaction predictionrule-based verificationintermediate stepschemical logicmodel evaluationstructured reasoning
Authors
Hongyu Guo, Hao Li, He Cao, Gongbo Zhang, Li Yuan
Abstract
Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may output the correct molecule, product, or option while its reasoning violates chemical logic. Existing process-level evaluators are hard to scale because LLM judges and human step-level process annotation are costly, inconsistent, and vulnerable to hallucination. We introduce ChemCoTBench-V2, a rule-verifiable diagnostic benchmark for low-cost, auditable evaluation of structured, verifier-addressable chemical reasoning traces. It spans molecular understanding, molecule editing, molecular optimization, and reaction prediction, with 5,620 evaluation samples across 18 reporting tasks. Models must expose key intermediate steps in expert-designed templates, and those steps are checked with deterministic chemistry rules and, for closed-answer tasks, reference traces rather than another LLM judge. Open-ended molecular optimization is evaluated with oracle-verifiable state constraints rather than strict trace matching. The benchmark reports three separate signals: final-answer correctness, template adherence, and step-wise verifier correctness over expert-refined intermediate commitments. Experiments on frontier models reveal a persistent gap between final-answer success and structured-reasoning-state consistency: models often follow the requested format while failing chemical-step checks, or answer correctly with weak supporting reasoning. ChemCoTBench-V2 enables fine-grained model comparison and identifies the concrete step at which the trace first violates the verifier.