Surflo: Consistent 3D Surface Flow Model with Global State
2026-06-11 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors present Surflo, a method that compresses multiple unposed RGB images into a single compact representation to reconstruct 3D surfaces. Unlike previous approaches that produce overlapping or low-resolution outputs, Surflo can generate a flexible number of oriented 3D points without being limited by a fixed grid. It uses a flow matching process to convert noise into surface points and includes a technique to make nearby points consistent by using image gradients during inference. Their approach is faster than optimization-based methods and achieves competitive or better accuracy on 3D surface reconstruction.
3D reconstructionflow matchinglatent representationpoint cloudsRGB imagesODE integrationphotometric gradientsglobal latentsurface metricsfeed-forward methods
Authors
Antoine Guédon, Shu Nakamura, Nicolas Dufour, Jiahui Lei, Ko Nishino, Angjoo Kanazawa
Abstract
Geometry is invariant to viewpoint, which makes any collection of images a redundant encoding of a single 3D state. Existing feed-forward reconstruction models fail to exploit this: per-view methods emit overlapping, unaligned pointmaps that grow linearly with input count, while global-latent methods commit to a fixed, low-resolution output. We introduce Surflo, which compresses a variable number of unposed RGB views into K latent tokens-one global state-and decodes oriented 3D surface points by independently transporting them from noise onto the surface via flow matching. This frees the output from any fixed grid or token budget: the same latent yields from a few thousand to a million points in a single forward pass. To suppress the local inconsistencies inherent to independent per-point decoding, an inference-time guidance term correlates nearby points by injecting a photometric gradient during ODE integration. Surflo matches or surpasses feed-forward baselines on surface metrics, runs an order of magnitude faster than optimization-based methods that require hundreds of views, and is the only feed-forward approach to combine a global latent with arbitrary-resolution decoding.