SpheRoPE: Zero-Shot Optimization-Free 360 Panorama Generation with Spherical RoPE

2026-06-30Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors propose a new method to create 360-degree panoramic images and videos using pre-trained diffusion transformer models without any additional training or slow optimization steps. They noticed that existing models have some natural ability for panoramas but don’t correctly handle the special way panoramic images are represented on a flat surface. To fix this, they replace parts of the model’s position encoding with a spherical version that respects the round shape of panoramas and add a guidance technique to keep the geometry accurate. This approach works with different model backbones and types of 360 content, achieving good results without retraining.

360 panoramic imagesdiffusion transformerszero-shot generationrotary position embeddingsequirectangular projectionspherical manifoldclassifier-free guidancetext-to-image generationsemantic distortionpre-trained models
Authors
Or Hirschorn, Aaron Olender, Eli Alshan, Ianir Ideses, Lior Fritz, Sagie Benaim
Abstract
We present a zero-shot, training-free and optimization-free framework for generating 360 panoramic images and videos by directly injecting spherical priors into pre-trained diffusion transformers. Existing methods either rely on costly fine-tuning on scarce panoramic data that limits generalization, or leverage multi-step optimization that incurs prohibitive inference latency. We observe that contemporary generative models natively exhibit some panoramic priors from large-scale training. However, these emergent capabilities are insufficient, as the models fundamentally fail to satisfy the rigorous topological constraints imposed by equirectangular projection (ERP). We introduce a zero-shot and optimization-free approach that resolves these constraints at inference time. Spherical RoPE replaces standard rotary position embeddings: low-frequency channels are re-parameterized as 3D Cartesian coordinates to natively encode the spherical manifold, while high-frequency channels are harmonically quantized to enforce exact periodicity. Coupled with complementary Semantic Distortion classifier-free guidance (CFG) that explicitly steers geometry, we avoid retraining and inherit the full creative breadth of state-of-the-art models. Our approach generalizes across diverse backbones and 360 generation modalities. We demonstrate this across text-to-panorama using Flux.1, Flux.2, and LTX-Video backbones, achieving competitive performance against baselines, all while remaining training-free. Project page: https://orhir.github.io/SpheRoPE