Evaluating Reasoning Fidelity in Visual Text Generation
2026-06-03 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and Language
AI summaryⓘ
The authors looked at how well recent text-to-image models can show detailed reasoning by generating images that contain text with complex solutions. They tested tasks like writing long texts, checking facts, understanding context, and multi-step reasoning. They found that while the text in the images looks clear, these models often make mistakes in meaning, logic, and steps compared to text-only models. This shows there is a big difference between just making text look good visually and truly capturing reasoning in images with text.
text-to-image modelsreasoning fidelityvisual text generationmulti-step reasoningsemantic errorslogical inconsistenciesfactual knowledge probingcontext understandingtext rendering
Authors
Jiajun Hong, Jiawei Zhou
Abstract
Recent text-to-image (T2I) models can render highly legible and well-structured text within images, enabling applications including document generation and slide generation. However, it remains unclear whether such systems faithfully preserve reasoning ability when complex solutions must be expressed directly through rendered text, or whether they merely imitate surface-level patterns. We investigate this question by evaluating reasoning fidelity in visual text generation, where models must express complete reasoning processes as images. Our evaluation includes long text rendering, factual knowledge probing, context understanding, and multi-step reasoning. Across these settings, we find that current T2I models frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps, even when the rendered text appears visually clear. These failures contrast with the strong reasoning performance of text-only models on the same tasks. Our findings reveal a substantial gap between visual text generation and procedural reasoning, motivating more reliable visual text reasoning.