World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible

2026-06-11 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionGraphics

AI summaryⓘ

The authors present World Tracing, a method that predicts a series of 3D points for each input pixel, capturing both visible surfaces and hidden shapes behind them. Their model, WT-DiT, uses a specialized transformer to understand multiple layers of geometry together, balancing accuracy for visible parts and completion of occluded parts. This approach improves on previous methods by better matching input images while generating full 3D shapes. It also keeps the connection between 2D images and 3D shapes, allowing new applications like text-based editing and novel video views.

image-to-3Ddepth estimation3D reconstructionocclusiontransformerdiffusion modelspixel alignmentgeometry completionnovel-view synthesistext-driven editing

Authors

Hao Zhang, Mohamed El Banani, Jen-Hao Cheng, Paul Zhang, Yi Hua, Ben Mildenhall, Christoph Lassner, Narendra Ahuja, Gengshan Yang

Abstract

Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracing diffusion transformer, WT-DiT, which treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching and a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation. World Tracing achieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves 2D-to-3D correspondence, enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.

View PDFOpen arXiv