Let ViT Speak: Generative Language-Image Pre-training

2026-05-01 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors introduce GenLIP, a simpler way to train Vision Transformers to understand images and link them directly to language by predicting words from visual data. Unlike other methods, their approach uses one model to handle both images and text together without extra parts, making it easier to scale and train. They trained GenLIP on a large dataset and found it performs as well or better than existing models while using less data. Additionally, training with images at different sizes helped the model improve on tasks needing fine detail, like reading text in images or understanding charts.

Vision Transformer (ViT)Multimodal Large Language Models (MLLMs)Generative PretrainingLanguage Modeling ObjectiveAutoregressive ModelsContrastive LearningOCR (Optical Character Recognition)Chart UnderstandingRecap-DataComp-1BPretraining

Authors

Yan Fang, Mengcheng Lan, Zilong Huang, Weixian Lei, Yunqing Zhao, Yujie Zhong, Yingchen Yu, Qi She, Yao Zhao, Yunchao Wei

Abstract

In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) \textbf{Simplicity}: a single transformer jointly models visual and textual tokens; (2) \textbf{Scalability}: it scales effectively with both data and model size; and (3) \textbf{Performance}: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.

View PDFOpen arXiv