Skip to content

HuMathe/sonoworld

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SonoWorld: From One Image to a 3D Audio-Visual Scene (CVPR 2026)

Official implementation of the CVPR 2026 paper "SonoWorld: From One Image to a 3D Audio-Visual Scene."

TL;DR Given a single input image, SonoWorld generates a 3D audio-visual scene with spatialized sound and scene-level assets.

Dataset and full setup instructions are coming soon.

Quick Start

After installing the required dependencies and preparing any model credentials/checkpoints needed by the selected stages, run:

python generate.py \
    --scene_root outputs/example_scene \
    --config configs/default.yaml \
    --input_image test-inputs/fall.jpg

Use --resume to continue a partially completed scene and --force to rerun completed stages.

News and Planned TODOs

  • 06.17.2026 Environment setup instructions
  • 06.02.2026 Released generation code
  • SonoScene360 dataset
  • Rendering code, evaluation tools, interactive viewer, and additional examples
  • Support for generation with open-source backbones, specifically HunyuanWorld 1.0 and LLaVA

Installation

We tested SonoWorld on an NVIDIA A6000 with GCC 14.2.0, CUDA 12.4.1, and Python 3.12.

Clone the repository and create the environment:

git clone --branch main --single-branch git@github.com:HuMathe/sonoworld.git
cd sonoworld
conda create -n sonoworld python=3.12
conda activate sonoworld

Install PyTorch

pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0 --index-url https://download.pytorch.org/whl/cu128

Install SAM3, MMAudio, GeoCalib, and MoGe:

mkdir -p third_party

git clone https://github.com/facebookresearch/sam3.git third_party/sam3
git -C third_party/sam3 checkout 757bbb0206a0b68bee81b17d7eb4877177025b2f
pip install -e third_party/sam3

git clone https://github.com/hkchengrex/MMAudio.git third_party/MMAudio
pip install -e third_party/MMAudio

git clone https://github.com/cvg/GeoCalib.git third_party/GeoCalib
pip install -e third_party/GeoCalib

pip install git+https://github.com/microsoft/MoGe.git

Install the remaining dependencies:

pip install -r requirements.txt
pip install --force-reinstall "setuptools<82"
conda install -c conda-forge ffmpeg

Citation

If you find our work useful, please cite:

@article{jin2026sonoworld,
    title={SonoWorld: From One Image to a 3D Audio-Visual Scene},
    author={Jin, Derong and Chen, Xiyi and Lin, Ming C. and Gao, Ruohan},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2026}
}

Licence

This project is released under the MIT Licence. See LICENSE.

Third-party code included in this repository may be subject to its own license terms, as noted in the corresponding source files.

About

Official implementation of the CVPR 2026 paper "SonoWorld: From One Image to a 3D Audio-Visual Scene."

Resources

License

Stars

Watchers

Forks

Contributors

Languages