Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

2026-05-26Computation and Language

Computation and LanguageComputer Vision and Pattern Recognition
AI summary

The authors created a new way to test how well computer models understand charts by changing the charts but keeping the questions the same. They made a tool called Chartographer that can turn charts into code, tweak them, and check if models still answer questions correctly. Their tests showed that many models struggle to handle these changed charts, even if they got the original ones right. This means models might rely too much on knowing the specific chart instead of truly understanding the visual information.

chart question-answeringvisual reasoningcounterfactual chartsChartographervision-language modelsbenchmarkinggeneralizabilityexecutable codedatasetmodel evaluation
Authors
Yifan Jiang, Dae Yon Hwang, Jesse C. Cresswell, Freda Shi
Abstract
Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.