LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

2026-04-22 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors present LLaDA2.0-Uni, a new language model that can understand and create both text and images in a single, unified system. They use special techniques to convert images into a form the model can process alongside text and improve how the model predicts output using diffusion and optimized decoding methods. Their training approach helps the model perform well in understanding multimodal data and generating or editing images. Overall, this work shows a way to build versatile AI models that handle mixed types of data efficiently.

discrete diffusionlarge language modelmultimodal understandingvisual tokenizerMixture of Experts (MoE)diffusion decodermasked diffusionimage generationmodel distillationfoundation models

Authors

Inclusion AI, Tiwei Bie, Haoxing Chen, Tieyuan Chen, Zhenglin Cheng, Long Cui, Kai Gan, Zhicheng Huang, Zhenzhong Lan, Haoquan Li, Jianguo Li, Tao Lin, Qi Qin, Hongjun Wang, Xiaomei Wang, Haoyuan Wu, Yi Xin, Junbo Zhao

Abstract

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.

View PDFOpen arXiv