Lighting-grounded Video Generation with Renderer-based Agent Reasoning

2026-04-09Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created LiVER, a new method that helps generate videos where you can control things like object positions, lighting, and camera movements separately. They built a large dataset to teach their model how to understand and use 3D scene details for video creation. LiVER uses a special training approach and a simple module to include these controls smoothly in the video-making process. It can turn images into videos or modify existing videos while letting users change the 3D scene exactly as they want. They also made a tool that converts easy instructions into the technical 3D controls needed for the model.

diffusion modelsvideo generation3D scene propertiesobject layoutlighting controlcamera trajectoryvideo-to-video synthesisimage-to-video synthesisconditioning moduletemporal consistency
Authors
Ziqi Cai, Taoyu Yang, Zheng Chang, Si Li, Han Jiang, Shuchen Weng, Boxin Shi
Abstract
Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.