Taming Outlier Tokens in Diffusion Transformers

2026-05-06Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning
AI summary

The authors studied strange or "outlier" tokens in Diffusion Transformers (DiTs), which are used for generating images. They found that these unusual tokens show up both in the parts that understand the image and the parts that create it, especially in the middle layers. They discovered that simply ignoring these tokens doesn't help, so the problem is deeper and tied to confused image details. To fix this, the authors designed a method called Dual-Stage Registers (DSR) that helps reduce these outlier effects, leading to better image generation on multiple datasets. Their work shows that managing these weird tokens is important to improve image-generating transformers.

Diffusion TransformersVision Transformersoutlier tokensRepresentation AutoencoderDual-Stage Registersimage generationtoken normtest-time registersdenoiserpatch semantics
Authors
Xiaoyu Wu, Yifei Wang, Tsu-Jui Fu, Liang-Chieh Chen, Zhe Gan, Chen Wei
Abstract
We study outlier tokens in Diffusion Transformers (DiTs) for image generation. Prior work has shown that Vision Transformers (ViTs) can produce a small number of high-norm tokens that attract disproportionate attention while carrying limited local information, but their role in generative models remains underexplored. We show that this phenomenon appears in both the encoder and denoiser of modern Representation Autoencoder (RAE)-DiT pipelines: pretrained ViT encoders can produce outlier representations, and DiTs themselves can develop internal outlier tokens, especially in intermediate layers. Moreover, simply masking high-norm tokens does not improve performance, indicating that the problem is not only caused by a few extreme values, but is more closely related to corrupted local patch semantics. To address this issue, we introduce Dual-Stage Registers (DSR), a register-based intervention for both components: trained registers when available, recursive test-time registers otherwise, and diffusion registers for the denoiser. Across ImageNet and large-scale text-to-image generation, these interventions consistently reduce outlier artifacts and improve generation quality. Our results highlight outlier-token control as an important ingredient in building stronger DiTs.