I am a third year Ph.D student in i-VisionGroup in the Department of Automation, Tsinghua University, advised by Prof. Jiwen Lu . In 2023, I received my BS degree from the Department of Automation, Tsinghua University.
I am interested in computer vision and deep learning. My current research focuses on autonomous driving and vision foundation models.
2026-02: One paper on 3D dense reconstruction is accepted to CVPR 2026.
2025-09: One paper on 3D occupancy prediction is accepted to NeurIPS 2025.
2025-06: One paper on embodied 3D occupancy prediction is accepted to ICCV 2025.
2025-02: One paper on 3D occupancy prediction is accepted to CVPR 2025.
2024-07: One paper on image representation learning is accepted to ECCV 2024.
Publications
*Equal contribution †Project leader.
DVGT: Driving Visual Geometry Transformer Sicheng Zuo* ,
Zixun Xie* ,
Wenzhao Zheng*†,
Shaoqing Xu,
Fang Li,
Shengyin Jiang,
Long Chen,
Zhi-Xin Yang,
Jiwen Lu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026.
[arXiv][Code][Project Page]
DVGT is a universal driving geometry model that reconstructs metric-scaled dense 3D point maps directly from unposed multi-view images, significantly outperforming existing SOTA methods and generalizing across diverse camera setups and driving scenarios.
QuadricFormer: Scene as Superquadrics for 3D Semantic Occupancy Prediction Sicheng Zuo* ,
Wenzhao Zheng*†,
Xiaoyong Han* ,
Longchao Yang,
Yong Pan,
Jiwen Lu The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025.
[arXiv][Code][Project Page]
QuadricFormer proposes geometrically expressive superquadrics as scene primitives, enabling efficient and powerful object-centric representation of driving scenes.
GaussianWorld reformulates 3D occupancy prediction as a 4D occupancy forecasting problem conditioned on the current sensor input and proposes a Gaussian World Model to exploit the scene evolution for perception.
EmbodiedOcc formulates an embodied 3D occupancy prediction task and employs a Gaussian-based framework to accomplish it.
SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding Han Xiao* ,
Wenzhao Zheng* ,
Sicheng Zuo ,
Peng Gao,
Jie Zhou ,
Jiwen Lu European Conference on Computer Vision (ECCV), 2024.
[Paper]
SpatialFormer proposes an efficient vision transformer architecture with explicit spatial understanding for generalizable image representation learning.
As the first 2D-projection-based method on the 3D semantic occupancy prediction task, PointOcc significantly outperforms all other methods by a large margin with a much faster speed.