RayPE: Ray-Space Positional Encoding for 3D-Aware Video Generation

2026-06-25Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors noticed that existing video transformers use a simple way to locate parts of a video that doesn't consider the actual 3D scene structure. They introduced RayPE, a new method that adds 3D geometric information about camera rays into the model's attention mechanism, making it better at understanding the scene's depth and camera relationships. Their approach carefully adjusts the encoding to work well across different types of video data and slightly changes the pretrained model without disrupting it. This leads to improved video quality, especially in how consistent the 3D scene appears across frames.

video diffusion transformerspositional encodingPlucker coordinatesself-attentiontransformer models3D geometrycamera raysSfM (Structure from Motion)SLAM (Simultaneous Localization and Mapping)RMSNorm
Authors
Minghao Yin, Jiahao Lu, Wenbo Hu, Wang Zhao, Shan Ying, Kai Han
Abstract
Modern video diffusion transformers position their tokens through RoPE on the (u,v,t) axes -- a description of the camera's sampling grid that says nothing about the 3D structure of the scene. We observe that the geometric relation between two camera rays is captured by the Plucker reciprocal product, which is bilinear in the two rays -- the same algebraic form as the dot product in Transformer attention. Building on this analogy, we propose RayPE, a positional-encoding extension that injects per-token 6D Plucker coordinates additively into the queries and keys of self-attention, with a query/key flip arrangement under which the symmetric identity configuration coincides exactly with the reciprocal product. The injection is additive, the resulting attention score decomposes into a content term, a geometry term, and two content and geometry cross-terms -- all of which our experiments find individually necessary. To make the encoding stable across video data with heterogeneous camera-translation scales (SfM, deep SLAM, metric), we further decouple ray direction from moment magnitude, gate the encoding by a learned function of the log-magnitude, and apply RMSNorm to align it with the QKNorm-normalized content branch. The full module adds less than 0.1% parameters to a pretrained video DiT, is zero-initialized to start from the pretrained weights, and improves camera controllability, cross-frame 3D consistency, and overall video quality on a four-dataset training mixture.