Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

2026-03-06 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors created Omni-Diffusion, a new type of language model that can understand and generate text, speech, and images all together. Unlike most existing models that use a step-by-step approach, their model uses a different method called discrete diffusion to handle various types of data at once. This lets it work well on tasks involving multiple types of inputs, sometimes even more than two at a time. Their tests show that Omni-Diffusion matches or beats other models designed for similar multimodal tasks.

multimodal large language modelsautoregressive architecturediscrete diffusion modelsmask-based modelingmultimodal tokensjoint distributiontext generationspeech processingimage understandingfoundation models

Authors

Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, Chaoyou Fu

Abstract

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.

View PDFOpen arXiv