AI summaryⓘ
The authors developed a new video compression method that works well at very low bitrates by combining two techniques: implicit neural representations (INRs) and pre-trained video diffusion models. INRs help by creating compact summaries of video frames, while diffusion models use learned patterns from large datasets to improve quality. They replace traditional video keyframes with these neural summaries to guide the compression process more efficiently. Tests showed their method produces better visual quality than existing video codecs like HEVC and VVC at very low data rates. Their work also reveals that the compression first captures the general scene and objects before improving fine details, leading to visually faithful videos with minimal data.
video compressionimplicit neural representationsvideo diffusion modelsbitratekeyframeLPIPSFIDHEVCVVCperceptual metrics
Authors
Eren Çetin, Lucas Relic, Yuanyi Xue, Markus Gross, Christopher Schroers, Roberto Azevedo
Abstract
We present a perceptually-driven video compression framework integrating implicit neural representations (INRs) and pre-trained video diffusion models to address the extremely low bitrate regime (<0.05 bpp). Our approach exploits the complementary strengths of INRs, which provide a compact video representation, and diffusion models, which offer rich generative priors learned from large-scale datasets. The INR-based conditioning replaces traditional intra-coded keyframes with bit-efficient neural representations trained to estimate latent features and guide the diffusion process. Our joint optimization of INR weights and parameter-efficient adapters for diffusion models allows the model to learn reliable conditioning signals while encoding video-specific information with minimal parameter overhead. Our experiments on UVG, MCL-JCV, and JVET Class-B benchmarks demonstrate substantial improvements in perceptual metrics (LPIPS, DISTS, and FID) at extremely low bitrates, including improvements on BD-LPIPS up to 0.214 and BD-FID up to 91.14 relative to HEVC, while also outperforming VVC and previous strong state-of-the-art neural and INR-only video codecs. Moreover, our analysis shows that INR-conditioned diffusion-based video compression first composes the scene layout and object identities before refining textural accuracy, exposing the semantic-to-visual hierarchy that enables perceptually faithful compression at extremely low bitrates.