[ECCV2026] CompoSIA: Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation
Wei Yin2, Weiqiang Ren2, Qian Zhang2, Yinqiang Zheng1†
1The University of Tokyo 2Horizon Robotics 3University of Glasgow
*Equal Contribution ‡Project Lead †Corresponding Author
- Paper
- Inference code
- Model weights
- Training code
CompoSIA is a compositional driving video simulator designed for fine-grained adversarial scenario generation through disentangled control of:
- Structure 🚗: object layout and trajectory placement
- Identity 🎨: appearance editing from a single reference image
- Action 🎮: ego-motion and controllable traffic dynamics
- Disentangled structure, identity, and action control
- Pose-agnostic identity injection
- Hierarchical dual-branch action conditioning
- Scenario generation for planner stress testing
Create a Python environment and install the project dependencies:
conda create -n composia python=3.10 -y
conda activate composia
cd CompoSIA
pip install -r requirements.txtrequirements.txt installs PyTorch 2.7.1 with CUDA 12.8 wheels by default. If your CUDA driver stack is different, install the matching PyTorch build first, then install the remaining dependencies.
The default evaluation path does not require the optional metrics packages. Install them only if you enable the corresponding metric:
# Required only when validation_kwargs.eval_metrics contains "met3r"
pip install git+https://github.com/mohammadasim98/met3r
# Required only when validation_kwargs.eval_metrics contains VBench-related evaluation
pip install vbenchCompoSIA uses the public Wan2.1 T2V 1.3B checkpoint as the base model and the released CompoSIA transformer/VAE weights.
Expected layout:
models/
├── Wan2.1-T2V-1.3B/
│ ├── config.json
│ ├── diffusion_pytorch_model.safetensors
│ ├── models_t5_umt5-xxl-enc-bf16.pth
│ ├── Wan2.1_VAE.pth
│ └── google/
│ └── umt5-xxl/
│ ├── special_tokens_map.json
│ ├── spiece.model
│ ├── tokenizer.json
│ └── tokenizer_config.json
├── composia/
│ └── composia-transformer.pt
└── vae/
└── composia-vae.pkl
Download links:
- Base model: Wan-AI/Wan2.1-T2V-1.3B
- CompoSIA weights: SUDOKISUI/CompoSIA
The released metadata files are hosted in SUDOKISUI/CompoSIA:
mkdir -p nuScenes-metadata-full/nuscenes_mmdet3d-12Hz
huggingface-cli download SUDOKISUI/CompoSIA \
nuscenes_interp_12Hz_infos_val_with_bid.pkl \
--local-dir nuScenes-metadata-full/nuscenes_mmdet3d-12HzFor images, download nuScenes from the official nuScenes website and unpack it so the sample images are available under:
nuScenes/origin/
└── samples/
└── CAM_FRONT/
└── ...
The default config reads:
samples_path: "./nuScenes/origin"
ann_path: "./nuScenes-metadata-full/nuscenes_mmdet3d-12Hz/nuscenes_interp_12Hz_infos_val_with_bid.pkl"If your nuScenes or metadata files are stored elsewhere, update these two paths in config/wan_unified.yaml.
Run the default evaluation script after preparing weights and data:
CUDA_VISIBLE_DEVICES=0 bash run_eval.shThe script uses:
MODEL_NAME=${MODEL_NAME:-models/Wan2.1-T2V-1.3B}
EVAL_CKPT=${EVAL_CKPT:-models/composia/composia-transformer.pt}
VAE_PATH=${VAE_PATH:-models/vae/composia-vae.pkl}For the Hugging Face release filenames, run:
CUDA_VISIBLE_DEVICES=0 \
MODEL_NAME=models/Wan2.1-T2V-1.3B \
EVAL_CKPT=models/composia/composia-transformer.pt \
VAE_PATH=models/vae/composia-vae.pkl \
bash run_eval.shGenerated videos and logs are written under logs/test/validation_res_final/.
The evaluation modes are configured in config/composia_unified_i2v_eval.yaml. By default, this file enables several action, bbox, and identity-editing modes. To run a smaller smoke test, reduce validation_kwargs.max_validation_samples or keep only one entry under validation_kwargs.val_modes.
CompoSIA builds on the open-source video generation and autonomous driving research ecosystem. Our base generative model is built upon Wan2.1, and our implementation benefits from the VideoX-Fun codebase.
We also thank NVIDIA Cosmos for inspiring components of our projection pipeline, and the developers of Hugging Face Diffusers, Accelerate, and Transformers for their model and inference tooling.
Our evaluation and data processing are built around the nuScenes dataset. We also acknowledge MEt3R and VBench for open-source video evaluation tools.
If you find our work useful, please cite it as
@article{zhan2026composing,
title={Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation},
author={Zhan, Yifan and Chen, Zhengqing and Wang, Qingjie and He, Zhuo and Niu, Muyao and Guo, Xiaoyang and Yin, Wei and Ren, Weiqiang and Zhang, Qian and Zheng, Yinqiang},
journal={arXiv preprint arXiv:2603.12864},
year={2026}
}