When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

2026-04-09 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors present NUMINA, a method that helps text-to-video systems better match the number of objects described in a prompt. NUMINA works without retraining by identifying mismatches between the prompt and the video layout, then guiding the system to fix these issues. It improves counting accuracy in generated videos and keeps visual quality stable over time. Their experiments show that NUMINA makes video outputs more accurate in object numbers while maintaining good alignment with the text.

text-to-video diffusionnumerical alignmentself-attentioncross-attentionlatent layoutCountBenchCLIP alignmenttemporal consistencyprompt layoutobject counting

Authors

Zhengyang Sun, Yu Chen, Xin Zhou, Xiaofan Li, Xiwu Chen, Dingkang Liang, Xiang Bai

Abstract

Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.

View PDFOpen arXiv