Xiaoming Zhao's photo

I am a Research Scientist in the Foundation Model (AFM) team at Apple, working on post-training and RL.

I earned my Ph.D. from Department of Computer Science, University of Illinois Urbana–Champaign (UIUC) under the guidance of Prof. Alexander Schwing. Earlier, I received B.S. degree in Statistics from University of Science and Technology of China (USTC).

I am passionate about multimodal intelligence, computer vision, and generative models, with the broader goal of unifying perception, generation, and reasoning in visual and multimodal systems.

During my graduate study, I interned at Apple, Meta Reality Labs, and Google, conducting research related to above topics.


Email    /    Google Scholar    /    GitHub    /    CV

Products

Image
Third Generation of Apple’s Foundation Models (AFM 3).
Core contributor to post-training efforts, with a primary focus on on-policy distillation.
Worldwide Developers Conference (WWDC), 2026   

We introduce our 3rd-generation foundation models, a full family that spans efficient on-device models to powerful server-scale models.

Publications

Image
Velox: Learning Representations of 4D Geometry and Appearance.
Anagh Malik, Dorian Chan, Xiaoming Zhao, David B. Lindell, Oncel Tuzel, and Jen-Hao Rick Chang
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026   
[Paper] [Website] [bibtex]   

We introduce a framework for learning latent representations of 4D objects which are descriptive, compressive, and accessible (requiring minimal input).
Image
LiTo: Surface Light Field Tokenization.
Jen-Hao Rick Chang*, Xiaoming Zhao*, Dorian Chan, and Oncel Tuzel
(* denotes equal contribution)
International Conference on Learning Representations (ICLR), 2026   
[Paper] [Code] [Website] [bibtex]   
  
We propose a 3D latent representation that jointly models object geometry and view-dependent appearance, enabling high-quality image-to-3D generation.
Image
Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective.
Xiaoming Zhao and Alexander G. Schwing
AAAI Conference on Artificial Intelligence (AAAI), 2026   
[Paper] [bibtex]   

We find both classifier guidance and classifier-free guidance achieve conditional generation by pushing the denoising diffusion trajectories away from data distribution's decision boundaries.
Image
3D Shape Tokenization via Latent Flow Matching.
Jen-Hao Rick Chang, Yuyang Wang, Miguel Angel Bautista, Jiatao Gu, Xiaoming Zhao, Joshua M. Susskind, and Oncel Tuzel
arXiv, 2025   
[Paper] [Website] [bibtex]   

We show, for the first time, that latent 3D representations learned from modeling 3D surface probability densities can scale and perform competitively.
Image
IllumiNeRF: 3D Relighting Without Inverse Rendering.
Xiaoming Zhao, Pratul P. Srinivasan, Dor Verbin, Keunhong Park, Ricardo Martin Brualla, and Philipp Henzler
Neural Information Processing Systems (NeurIPS), 2024   
[Paper] [Results] [Website] [Leaderboard] [bibtex]   
  
IllumiNeRF provides a simpler approach than traditional inverse rendering for 3D relighting: distilling samples from a single-image relighting diffusion model into a latent-variable NeRF.
Image
GoMAvatar: Efficient Animatable Human Modeling From Monocular Video Using Gaussians-on-Mesh.
Jing Wen, Xiaoming Zhao, Zhongzheng Ren, Alexander G. Schwing, and Shenlong Wang
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024   
[Paper] [Code] [Website] [bibtex]   

GoMAvatar introduces Gaussians-on-Mesh (GoM) representation for real-time, memory-efficient, and high-quality animatable human modeling.
Image
Image
NeRFDeformer: NeRF Transformation From a Single View via 3D Scene Flows.
Zhenggang Tang, Zhongzheng Ren, Xiaoming Zhao, Bowen Wen, Jonathan Tremblay, Stan Birchfield, and Alexander G. Schwing
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024   
[Paper] [Code] [Website] [bibtex]   

NeRFDeformer automatically modifies a NeRF representation based on a single RGB-D observation of a non-rigid transformed version of the original scene.
Image
Pseudo-Generalized Dynamic View Synthesis From a Video.
Xiaoming Zhao, Alex Colburn, Fangchang Ma, Miguel Angel Bautista, Joshua M. Susskind, and Alexander G. Schwing
International Conference on Learning Representations (ICLR), 2024   
[Paper] [Code] [Website] [bibtex]   

PGDVS provides an analysis framework for generalized dynamic view synthesis and finds with consistent depth estimations, scene-specific appearance optimization is NOT required.
Image
Occupancy Planes for Single-View RGB-D Human Reconstruction.
Xiaoming Zhao, Yuan-Ting Hu, Zhongzheng Ren, and Alexander G. Schwing
AAAI Conference on Artificial Intelligence (AAAI), 2023   
[Paper] [Code] [bibtex]   

OPlanes provides more flexibility than voxel grids and enables to better leverage correlations than per-point classification.
Image
Generative Multiplane Images: Making a 2D GAN 3D-Aware.
Xiaoming Zhao, Fangchang Ma, David Güera, Zhile Ren, Alexander G. Schwing, and Alex Colburn
European Conference on Computer Vision (ECCV), 2022 (Oral Presentation)   
[Paper] [Code] [Website] [bibtex]   
  
GMPI guarantees to be view-consistent and enables fast training (in less than half a day at a resolution of 10242) and high FPS during inference.
Image
Initialization and Alignment for Adversarial Texture Optimization.
Xiaoming Zhao, Zhizhen Zhao, and Alexander G. Schwing
European Conference on Computer Vision (ECCV), 2022   
[Paper] [Code] [Website] [bibtex]   

Carefully designed initialization and alignment procedures enable benefiting from both classical and recent learning-based texture optimization techniques.
Image
Image
Class-agnostic Reconstruction of Dynamic Objects From Videos.
Zhongzheng Ren*, Xiaoming Zhao*, and Alexander G. Schwing
(* denotes equal contribution)
Neural Information Processing Systems (NeurIPS), 2021   
[Paper] [Website] [bibtex]   

REDO enables class-agnostic geometry reconstruction for dynamic objects from RGB-D videos.
Image
The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation.
Xiaoming Zhao, Harsh Agrawal, Dhruv Batra, and Alexander G. Schwing
International Conference on Computer Vision (ICCV), 2021   
[Paper] [Code] [Website] [bibtex]   

A well-trained visual odometry module can be a drop-in replacement for GPS and Compass sensor in PointGoal navigation.
Image
Image
Mitigating Data Scarcity in Protein Binding Prediction Using Meta-Learning.
Yunan Luo*, Jianzhu Ma*, Xiaoming Zhao, Yufeng Su, Yang Liu, Trey Ideker, and Jian Peng
(* denotes equal contribution)
Research in Computational Molecular Biology (RECOMB), 2019   
[Paper] [bibtex]   

Meta-learning and few-shot learning strategy can be utilized to mitigate the data scarcity issue in characterizing the specificity of less-studied kinases for protein-peptide binding prediction.
Image
Image
Integrating Thermodynamic and Sequence Contexts Improves Protein-RNA Binding Prediction.
Yufeng Su, Yunan Luo, Xiaoming Zhao, Yang Liu, and Jian Peng
PLOS Computational Biology, 2019   
[Paper] [Code] [bibtex]   

A deep learning-based thermodynamic model is introduced for protein-RNA binding prediction.
Image
Image
Stochastic Variance Reduction for Deep Q-Learning.
Wei-Ye Zhao, Xi-Ya Guan, Yang Liu, Xiaoming Zhao, and Jian Peng
arXiv, 2019   
[Paper] [bibtex]

Slides

Image
Image
Harnessing Data Priors to Mitigate 3D Data Scarcity.   
The slides are almost the same as those for my job talk Harnessing "Dark" Data.   

2024/10: PhD Thesis Defense