MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation

2026-06-24 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors introduce MIMFlow, a new method that combines Masked Image Modeling (MIM) and Normalizing Flows (NFs) to improve image generation. They use a VAE encoder to extract important high-level image features, letting the Normalizing Flow focus on simpler semantic information instead of pixel details. This approach helps the model better capture overall image structure while still accurately reconstructing fine details. Their experiments on ImageNet show improved performance compared to similar models, using fewer tokens to represent images.

Normalizing FlowsMasked Image ModelingVariational AutoencoderSemantic Latent SpaceImage GenerationDensity EstimationPixel ReconstructionFID ScoreLinear Probing

Authors

Yang Chen, Xiaowei Xu, Shuai Wang, Xinwen Zhang, Qiushi Guo, Tiezheng Ge, Limin Wang

Abstract

Normalizing Flows (NFs) are powerful generative models capable of exact density estimation and sampling. However, their strict invertibility often forces the model to exhaust its capacity on low-level pixel details, hindering the capture of high-level semantic structures. While Masked Image Modeling (MIM) has excelled in representation learning, its integration into generative pipelines has remained largely modular and disjointed. In this paper, we propose MIMFlow, a unified end-to-end framework that jointly optimizes latent semantics, pixel reconstruction, and generative flow. By employing a VAE encoder to infer semantic latent from masked images, MIMFlow achieves a principled decoupling of the generative task: the Normalizing Flow focuses on modeling a simplified, low-frequency semantic manifold, while a specialized decoder handles high-frequency synthesis. This design effectively resolves the inherent capacity bottleneck of NFs, allowing the model to prioritize global structural coherence over redundant noise. Empirical results on ImageNet 256$\times$256 show that MIMFlow-L reaches 71.3\% linear probing accuracy and an FID of 2.50. Despite using only 128 tokens (50\% fewer than standard models), it yields a 32.8\% performance gain over similar-scale NF baselines. Our code is available at https://github.com/MCG-NJU/MIMFlow.

View PDFOpen arXiv