GitHub - sv-pp/SceneVersepp: Official implementation of CVPR26 paper "Lifting Unlabeled Internet-level Data for 3D Scene Understanding"

Lifting Unlabeled Internet-level Data for 3D Scene Understanding
CVPR 2026

TL;DR

Annotated 3D scene data is scarce. We build an automated data engine that lifts web videos into structured 3D supervision — instance-level point clouds, object layouts, spatial VQA, and vision-language navigation — and show through experiments that this generated data has strong potential to supplement the broad 3D scene understanding.

What's in this repo

This is the public release of the training code and data pipeline from the paper.

Directory	Purpose
`PQ3D/`	3D instance segmentation training
`SpatialLM/`	3D object detection training
`data_processing/`	Video download, frame extraction, camera-pose visualization for the SVPP dataset

Quick start

1. Get the dataset

huggingface-cli download bigai/SceneVersepp --repo-type dataset --local-dir ./svpp

2. Set up the data-processing environment

The scripts in data_processing/ (video download, frame extraction, pose visualization) use a light-weight environment defined by requirements.txt:

conda create -n svpp python=3.10 -y
conda activate svpp
pip install -r requirements.txt

The training stacks under PQ3D/ and SpatialLM/ each have their own heavier environments. See their respective READMEs.

3. Process the raw videos

# Download YouTube videos referenced by each scene's data_info.json
python data_processing/download_videos.py ./svpp

# Extract raw and cropped frames into images/ and crop_images/
python data_processing/extract_images.py ./svpp

# (Optional) Visualize camera poses for one scene with Open3D
python data_processing/view_camera_poses.py ./svpp --scene-name bedroom_100_3o5KSzfdOSE

4. Train

Each training stack is independent and ships with its own README.md:

PQ3D/README.md — segmentation data generation and two-stage training
SpatialLM/README.md — layout generation, pretraining, fine-tuning, inference, and evaluation

Citation

@inproceedings{chen2026lifting,
  title     = {Lifting Unlabeled Internet-level Data for 3D Scene Understanding},
  author    = {Chen, Yixin and Zhang, Yaowei and Yu, Huangyue and He, Junchao and Wang, Yan and Huang, Jiangyong and Shen, Hongyu and Ni, Junfeng and Wang, Shaofei and Jia, Baoxiong and Zhu, Song-Chun and Huang, Siyuan},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Acknowledgements

This repository builds on:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lifting Unlabeled Internet-level Data for 3D Scene Understanding
CVPR 2026

TL;DR

What's in this repo

Quick start

1. Get the dataset

2. Set up the data-processing environment

3. Process the raw videos

4. Train

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
PQ3D		PQ3D
SpatialLM		SpatialLM
assets		assets
data_processing		data_processing
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Lifting Unlabeled Internet-level Data for 3D Scene Understanding CVPR 2026

TL;DR

What's in this repo

Quick start

1. Get the dataset

2. Set up the data-processing environment

3. Process the raw videos

4. Train

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Lifting Unlabeled Internet-level Data for 3D Scene Understanding
CVPR 2026

Packages