Official implementation of the CVPR 2026 paper "SonoWorld: From One Image to a 3D Audio-Visual Scene."
TL;DR Given a single input image, SonoWorld generates a 3D audio-visual scene with spatialized sound and scene-level assets.
Dataset and full setup instructions are coming soon.
After installing the required dependencies and preparing any model credentials/checkpoints needed by the selected stages, run:
python generate.py \
--scene_root outputs/example_scene \
--config configs/default.yaml \
--input_image test-inputs/fall.jpgUse --resume to continue a partially completed scene and --force to rerun completed stages.
-
06.17.2026Environment setup instructions -
06.02.2026Released generation code - SonoScene360 dataset
- Rendering code, evaluation tools, interactive viewer, and additional examples
- Support for generation with open-source backbones, specifically HunyuanWorld 1.0 and LLaVA
We tested SonoWorld on an NVIDIA A6000 with GCC 14.2.0, CUDA 12.4.1, and Python 3.12.
Clone the repository and create the environment:
git clone --branch main --single-branch git@github.com:HuMathe/sonoworld.git
cd sonoworld
conda create -n sonoworld python=3.12
conda activate sonoworldInstall PyTorch
pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0 --index-url https://download.pytorch.org/whl/cu128Install SAM3, MMAudio, GeoCalib, and MoGe:
mkdir -p third_party
git clone https://github.com/facebookresearch/sam3.git third_party/sam3
git -C third_party/sam3 checkout 757bbb0206a0b68bee81b17d7eb4877177025b2f
pip install -e third_party/sam3
git clone https://github.com/hkchengrex/MMAudio.git third_party/MMAudio
pip install -e third_party/MMAudio
git clone https://github.com/cvg/GeoCalib.git third_party/GeoCalib
pip install -e third_party/GeoCalib
pip install git+https://github.com/microsoft/MoGe.gitInstall the remaining dependencies:
pip install -r requirements.txt
pip install --force-reinstall "setuptools<82"
conda install -c conda-forge ffmpegIf you find our work useful, please cite:
@article{jin2026sonoworld,
title={SonoWorld: From One Image to a 3D Audio-Visual Scene},
author={Jin, Derong and Chen, Xiyi and Lin, Ming C. and Gao, Ruohan},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}
This project is released under the MIT Licence. See LICENSE.
Third-party code included in this repository may be subject to its own license terms, as noted in the corresponding source files.