DSA: Dynamic Step Allocation for Fast Autoregressive Video Generation
2026-06-03 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors address the problem of slow video generation by video diffusion transformers, which are usually slow because they use a fixed number of steps for every frame. They developed DSA, a method that uses a small extra part of the model to guess how much work each frame needs and then adjusts the number of steps accordingly—fast for easy frames, slower for harder ones. This approach speeds up video generation without needing new video data or big changes to the model. Their experiments show that DSA can generate videos in real time with quality similar to or better than previous methods.
video diffusion transformersautoregressive modelsdenoising stepsconfidence-guided adaptive computationdistribution-matching distillationinference latencyreal-time video generationVBench qualityadaptive samplingH100 GPUs
Authors
Thanh-Tung Le, Yunhan Zhao, Menglei Chai, Zhengyang Shen, Zhe Cao, Danhang Tang, Xiaohui Xie, Deying Kong
Abstract
Video diffusion transformers have achieved state-of-the-art visual quality, but their high inference cost remains a major bottleneck for real-time applications. Recent distillation frameworks produce autoregressive video diffusion models with reduced latency, yet these models still use a fixed number of denoising steps per frame, wasting computation on predictable frames and under-refining challenging ones. We present DSA, a confidence-guided adaptive computation framework for AR video diffusion. DSA introduces a lightweight confidence head, trained jointly with the generator under a distribution-matching distillation objective, to estimate per-frame denoising reliability. At inference, this confidence signal dynamically adjusts the number of diffusion steps: simple frames terminate early for speed, while complex frames receive additional refinement. Our method requires no extra video data, no heuristics, and little architectural modification. Experiments show that DSA achieves real-time autoregressive video generation, reaching 22.63 FPS with sub-second latency on H100 GPUs, while maintaining competitive or superior VBench quality compared to recent autoregressive and bidirectional video diffusion models. Our results demonstrate that confidence-guided adaptive sampling provides an effective and practical path toward interactive video generation.