Channel-wise Vector Quantization

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors propose a new way to break down images called Channel-wise Vector Quantization (CVQ), which analyzes images channel by channel instead of by patches. This method allows images to be represented as levels of visual detail, capturing structure first and details later, similar to how an artist works. Using CVQ, they build a model called Channel-wise Autoregressive (CAR) that predicts image channels one after another to generate images with better quality. Their tests show improved image reconstruction and strong performance in text-to-image tasks compared to traditional methods.

Vector QuantizationImage TokenizationFeature Map ChannelsAutoregressive ModelsText-to-Image GenerationCodebookImage ReconstructionPatch-wise TokensVisual DetailsNext-Channel Prediction
Authors
Wei Song, Tianhang Wang, Yitong Chen, Tong Zhang, Zuxuan Wu, Ming Li, Jiaqi Wang, Kaicheng Yu
Abstract
We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with "next-channel prediction". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.