PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

2026-03-26 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors introduce PackForcing, a method to improve video generation over long durations by efficiently managing memory and reducing repeated errors. They divide the video history into three parts: key early frames kept in full detail, heavily compressed middle frames, and recent frames also in full detail to keep things smooth. Their approach smartly selects important context and adjusts timing to keep videos coherent without using too much memory. This lets their model create high-quality 2-minute videos on a single GPU, even when trained on short clips. Tests show it produces more consistent and dynamic videos than previous methods.

autoregressive modelsvideo diffusionKV-cachetemporal coherence3D convolutionsvariational autoencoder (VAE)top-k selectionRelative Positional Encoding (RoPE)zero-shot learninglong-video synthesis

Authors

Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, Kaipeng Zhang

Abstract

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-$k$ context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing

View PDFOpen arXiv