TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

2026-05-29Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors study how text-to-video models create videos with many events over time. They find special points in the video creation process where the text instructions shape different video parts, from big layout to small details. Using this, they make TunerDiT, a new method that helps control event boundaries and blends nearby event meanings without extra training. They also make a set of test prompts called Meve to check multi-event video generation. Their method improves how well videos match the text, especially as the number of events grows.

text-to-video generationdiffusion modelsvideo diffusion transformersdenoising trajectoryevent partitioningprompt fusionmulti-event videotraining-free methodsvideo consistencytext alignment
Authors
Ruotong Liao, Guowen Huang, Qing Cheng, Guangyao Zhai, Lei Zhang, Xun Xiao, Thomas Seidl, Daniel Cremers, Volker Tresp
Abstract
Text-to-video (T2V) generation faces challenging questions when generating videos with long horizons containing multiple events. Inspired by the intrinsics of the diffusion process, we probe video diffusion transformers (DiTs) and uncover intrinsic turning points in the DiT denoising trajectory where conditioning text affects generation from global layout to fine-grained details. Building on this finding, we present TunerDiT, a simple yet effective progressive steering method that requires no additional training for multi-event generation. TunerDiT comprises two steering handles: (1) Event-Partitioned Masking that enforces event boundaries while allowing cross-event transition bands; (2) Cross-Event Prompt Fusion that injects neighboring event semantics for late-stage refinement. We contribute a self-curated prompt suite for benchmarking multi-event generation, i.e., Meve. TunerDiT achieves state-of-the-art performance across 8 metrics and offers a tunable trade-off between video consistency and event separation, compared with other training-free methods. The improvement in text alignment increases with the event count, indicating a scaling possibility with increasing event count.