From SRA to Self-Flow: Data Augmentation or Self-Supervision?

2026-07-02Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors studied how a method called Self-Flow improves training of diffusion transformer models by using interactions between tokens with different noise levels. They tested whether the improvement really comes from these interactions or from another factor called data augmentation. Their experiments showed that blocking token interactions did not hurt performance and sometimes helped, suggesting the benefits come mainly from data augmentation. They also found that their new technique, Attention Separation, acts like data augmentation by splitting images into multiple parts for training. Combining these ideas, the authors improved training results on the ImageNet dataset.

diffusion transformerrepresentation alignmentself-alignmentSelf-FlowSRAdual-time schedulingdata augmentationattention mechanismImageNet
Authors
Dengyang Jiang, Mengmeng Wang, Harry Yang, Jingdong Wang
Abstract
Representation alignment has become an effective way to accelerate diffusion transformer training and improve generation quality. Recent self-alignment methods, such as SRA and Self-Flow, further remove the dependency on external pretrained encoders by constructing alignment within the diffusion model itself. However, the mechanism behind the improvement from SRA to Self-Flow, dual-time scheduling, remains under-examined: Self-Flow attributes its gain to interactions between tokens at different noise levels, where cleaner tokens help infer noisier ones. In this work, we revisit this explanation and ask whether the gain instead comes from data augmentation along the noise dimension. To disentangle these factors, we introduce Attention Separation, which preserves the same dual-timestep input as Self-Flow while blocking attention between tokens assigned to different noise levels. Surprisingly, removing such interaction does not degrade performance and can even improve it, suggesting that the improvement from SRA to Self-Flow mainly comes from data augmentation. Furthermore,We show that Attention Separation itself provides an augmentation effect by splitting a single image into multiple effective training parts to expand the training data. Based on these observations, we combine self-representation alignment with dual-timestep and attention-separation augmentation, and demonstrate the effectiveness of this design on ImageNet.