SonoWorld: From One Image to a 3D Audio-Visual Scene

2026-03-30 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionMultimediaSound

AI summaryⓘ

The authors introduce a new task called Image2AVScene, which means creating a 3D world with sound from just one picture. They developed a system named SonoWorld that can expand a single image into a full 360° panorama, turn it into a 3D space you can explore, and add sounds that match the scene based on language instructions. Their method produces spatial audio that aligns with the scene’s layout and meaning. Tests and user studies show their system works well. They also show how this approach can help with learning sounds from one example and separating audio sources in 3D space.

3D scene generation360° panoramaspatial audioambisonicsaudio-visual renderingsound anchorsone-shot learningaudio source separationscene semanticsimage outpainting

Authors

Derong Jin, Xiyi Chen, Ming C. Lin, Ruohan Gao

Abstract

Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: https://humathe.github.io/sonoworld/

View PDFOpen arXiv