Novel View Synthesis as Video Completion
2026-04-09 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors address how to create new views of a scene from just a few pictures taken from different angles, using video-based AI models. Unlike previous methods that use AI trained on single images, they use video models that already understand multiple views, by treating the problem like filling in missing frames in a low frame-rate video. They make changes to these video models so they don’t rely on the order of input images, which is important because the input pictures can come in any order. Their approach, called FrameCrafter, shows promising results with only a little extra training.
novel view synthesisvideo diffusion modelsmulti-view imagescamera posepermutation invariancelow frame-rate video completionlatent encodingtemporal positional embeddingssparse viewFrameCrafter
Authors
Qi Wu, Khiem Vuong, Minsik Jeon, Srinivasa Narasimhan, Deva Ramanan
Abstract
We tackle the problem of sparse novel view synthesis (NVS) using video diffusion models; given $K$ ($\approx 5$) multi-view images of a scene and their camera poses, we predict the view from a target camera pose. Many prior approaches leverage generative image priors encoded via diffusion models. However, models trained on single images lack multi-view knowledge. We instead argue that video models already contain implicit multi-view knowledge and so should be easier to adapt for NVS. Our key insight is to formulate sparse NVS as a low frame-rate video completion task. However, one challenge is that sparse NVS is defined over an unordered set of inputs, often too sparse to admit a meaningful order, so the models should be $\textit{invariant}$ to permutations of that input set. To this end, we present FrameCrafter, which adapts video models (naturally trained with coherent frame orderings) to permutation-invariant NVS through several architectural modifications, including per-frame latent encodings and removal of temporal positional embeddings. Our results suggest that video models can be easily trained to "forget" about time with minimal supervision, producing competitive performance on sparse-view NVS benchmarks. Project page: https://frame-crafter.github.io/