TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

2026-04-08 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors introduce TC-AE, a new type of neural network using Vision Transformers (ViT) for compressing images while keeping good quality. They found that just increasing the number of channels in the compressed data can cause problems, so they focused on how image tokens (small patches of the image) are handled. They improved this by splitting the compression process into two steps and training the tokens to better represent image details. This approach helps keep the image clear even when heavily compressed, without needing more complicated models or training tricks.

Vision Transformer (ViT)autoencoderimage compressionlatent representationtokenizationself-supervised learningpatch sizegenerative modelsstructural informationtoken-to-latent compression

Authors

Teng Li, Ziyuan Huang, Cong Chen, Yangfu Li, Yuanhuiyi Lyu, Dandan Zheng, Chunhua Shen, Jun Zhang

Abstract

We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information loss and enabling effective token number scaling for generation. Secondly, to further mitigate latent representation collapse, we enhance the semantic structure of image tokens via joint self-supervised training, leading to more generative-friendly latents. With these designs, TC-AE achieves substantially improved reconstruction and generative performance under deep compression. We hope our research will advance ViT-based tokenizer for visual generation.

View PDFOpen arXiv