When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors

2026-06-30Computation and Language

Computation and LanguageArtificial Intelligence
AI summary

The authors studied mistakes made by large language models when they incorrectly use or leave out numbers from tables during reasoning. They found that these errors happen in all tested models, regardless of size. To fix this, they introduced a separate system that checks and filters out errors, improving the models' final answers. They also built a smaller model that can detect these table referencing errors well and help bigger models avoid mistakes.

large language modelstable tasksdata referencing errorsintermediate reasoningcritic modelfilteringrejection samplingF1 scorein-distributionout-of-distribution
Authors
Yuqing Yang, Qi Zhu, Zhen Han, Boran Han, Zhengyuan Shen, Shuai Wang, Vassilis N. Ioannidis, Huzefa Rangwala
Abstract
While large language models (LLMs) perform well on table tasks, they still make data referencing errors (DREs), i.e., incorrectly citing or omitting table values, despite understanding the table structure. Beyond final-answer accuracy, DREs directly compromise the correctness and reliability of intermediate reasoning steps. Yet prior studies have only offered limited, small-scale analyses. In this work, we present the first systematic evaluation of tabular data referencing errors across different models and tasks. Our results show that DREs occur across all tested models (1.7B to 20B parameters). Furthermore, we demonstrate that incorporating data referencing as a critic significantly improves answer accuracy up to 12.0%, through critic-based filtering and rejection sampling. Finally, we trained a lightweight 4B-parameter critic model that achieves an average F1 score of 78.2% in detecting both in-distribution and out-of-distribution DREs, and effectively assists inference for larger models.