Dilin Wang

Dilin is currently Senior Staff at Meta Reality Labs on the 3D Gen AI team. His work focuses on generative AI for 3D content creation, building models and systems that turn images and language into deployable 3D assets, scenes, and interactive worlds.

His research often spans the end-to-end path from data licensing and processing to model architecture, representation, evaluation, and the engineering needed to make these models useful in interactive, agentic systems.

More broadly, he is interested in spatial intelligence: 3D generation, reconstruction, perception, understanding and reasoning. His work often connects visual inputs, geometry, language, and real deployment constraints.

Earlier, Dilin completed his PhD in computer science at UT Austin, advised by Qiang Liu, with research spanning variational inference and generative modeling.

Recent Work

AssetGen

AssetGen converts visual intent into production-ready 3D assets: mesh, baked normals, color texture, and controlled polygon count. The system is built for settings where generated assets need to be usable immediately in games, simulations, and interactive 3D environments. In 3p blind evaluations, it reaches competitive quality against leading commercial systems and runs in 30 seconds instead of the several minutes common for baselines; a Flash variant supports sub-15-second previews. It also outperforms open-source models such as SAM 3D and Trellis 2. Technical details are in the paper.

WorldGen

WorldGen generates explicit 3D scenes from text: navigable, render-ready, editable, and naturally suited for multiplayer interactive experiences. It combines LLM-driven layout reasoning, procedural generation, diffusion-based 3D generation, and object-aware scene decomposition. The result is a functional 3D environment that can be explored, edited, rendered, and used by agents or players. See the paper for the research version.

Quest mixed-reality

Before 3D generation, Dilin worked on power-efficient perception for Quest and AR, including ML depth for passthrough, a foundational capability for enabling mixed reality. The work focused on helping hardware understand 3D depth under tight latency, memory, and power constraints.

News and Recent Papers

VLM-3R augments vision-language models with instruction-aligned 3D reconstruction for spatial reasoning from monocular video.

For the complete and most up-to-date publication list, please see Google Scholar.

Contact

Please contact me via email at dilinwang@utexas.edu.