Himangi Mittal

I am a Ph.D. student in the Robotics Institute (RI) at Carnegie Mellon University (CMU), working with Prof. Shubham Tulsiani. My research focuses on building physics-grounded AI agents, video generation via preference-based RL optimization, and differentiable simulation. I am currently collaborating with Microsoft on Computer-Use-Agents (CUA), GUI grounding, Coding Agents, and 4D asset generation research.

I graduated with a Master of Science in Robotics (MSR) from the Robotics Institute at Carnegie Mellon University where I worked with Prof. Abhinav Gupta and collaborated with Prof. Pedro Morgado at UW-Madison. Before my Master's, I worked as a Research Assistant at CMU with Prof. David Held at the R-Pad Lab, in collaboration with Pittsburgh-based autonomous driving company, Argo AI.

During my Masters at CMU, I had worked on self-supervised representation learning methods for multimodal audio-visual videos and as a RA at CMU, I worked on self-supervised algorithms for real-world 3D LiDAR point clouds.

I have served in the organizing committee of WiCV@CVPR 2025, WiCV@CVPR 2024, and DEI Social Event@CVPR 2024.

Email / CV / Google Scholar / Twitter / Github / Linkedin

News

June 2025: Member of the organizing committee at WiCV@CVPR 2025.
Feb 2025: Paper accepted at CVPR 2025.
June 2024: Member of the organizing committee at WiCV@CVPR 2024, DEI Social Event , and
Challenges/Opportunities for ECRs in Fast Paced AI Social Event!
February 2024: Paper accepted at CVPR 2024.
August 2023 : Started my Ph.D. in the Robotics Institute (RI) at Carnegie Mellon University (CMU).
May 2023 : Started research internship at Honda Research Institute (HRI), San Jose, California.

Research

See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback
Himangi Mittal, Gaurav Mittal, Nelson Daniel Troncoso, Yu Hu
Arxiv / Code

We study pixel-precise GUI grounding for Computer Use Agents in dense coding interfaces. Instead of single-shot coordinate prediction, our agent iteratively refines cursor localization using visual feedback from previous attempts, enabling self-correction of displacement errors. Evaluated across GPT-5.4, Claude, and Qwen, multi-turn refinement significantly outperforms single-shot models in click precision and task success.

Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation
Su Sun, Cheng Zhao, Himangi Mittal, Gaurav Mittal, Rohith Kukkala, Yingjie Victor Chen, Mei Chen
[CVPR 2026]
Arxiv

Track4DGen is a two-stage framework that integrates foundation point-tracker motion priors into multi-view video diffusion and 4D Gaussian Splatting reconstruction. By enforcing dense feature-level point correspondences during generation and augmenting reconstruction with tracker-derived motion encoding and 4D Spherical Harmonics, it produces temporally stable, text-editable 4D assets surpassing existing baselines.

UniPhy: Learning a Unified Constitutive Model for Inverse Physics Simulation
Himangi Mittal, Peiye Zhuang, Hsin-Ying Lee, Shubham Tulsiani
[CVPR 2025]
Paper / Arxiv / Webpage / Code

UniPhy is a unified latent-conditioned neural model which learns a common latent space to encode the properties of diverse materials. At inference, given motion observations for a system with unknown material parameters, UniPhy allows material inference via differentiable simulation latent optimization.

Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models
Himangi Mittal, Nakul Agarwal, Shao-Yuan Lo, Kwonjoon Lee
[CVPR 2024]
Paper / Arxiv

We leverage a large video-language model for anticipating action sequences that are plausible in the real-world. We develop the understanding of plausibility of an action sequence in a large video-language model by introducing two objective functions, a counterfactual-based plausible action sequence learning loss and a long-horizon action repetition loss.

Learning State-Aware Visual Representations from Audible Interactions
Himangi Mittal, Pedro Morgado, Unnat Jain, Abhinav Gupta
[NeurIPS 2022]
ECCV 2022 Workshop on Visual Object-oriented Learning meets Interaction (VOLI): Discovery, Representations, and Applications
Sight and Sound Workshop (CVPR 2023)
Paper / Arxiv / Code / Video

We propose a self-supervised algorithm to learn representations from untrimmed, egocentric videos containing audible interactions. Our method uses the audio signals in two unique ways: (1) to identify moments in time that are conducive to better self-supervised learning and (2) to learn representations that focus on the visual state changes caused by audible interactions.

Self-Supervised Point Cloud Completion via Inpainting
Himangi Mittal, Brian Okorn, Arpit Jangid, David Held
[BMVC 2021 - Oral (Selection rate 3.3%)]
Paper / Arxiv / Code / Conference Presentation / Webpage

A self-supervised method to complete the incomplete, partial point clouds for real-world settings like LiDAR where ground truth complete point cloud annotations are unavailable. We achieve this via inpainting where a region of the point cloud is removed and the network is trained to complete this removed region.

Just Go with the Flow: Self-Supervised Scene Flow Estimation
Himangi Mittal, Brian Okorn, David Held
[CVPR 2020 - Oral (Selection rate 5.7%)]
RSS 2020 Workshop on Self-Supervised Robot Learning
Paper / Arxiv / Code / Media article 1 / Media article 2 / Project Page / Video / Short Paper

A method of training scene flow that uses two self-supervised losses, based on nearest neighbors and cycle consistency. These self-supervised losses allow us to train our method on large unlabeled autonomous driving datasets.

	Interpreting Context of Images using Scene Graphs Himangi Mittal, Ajith Abraham, Anuja Arora [International Conference on Big Data Analytics (BDA), 2019] Paper / ArXiv / Code Predicted action and spatial relationships in images between objects detected by YOLO, then combining VGG-Net based visual features and Word2Vec based semantic features.
	Anomaly Detection using Graph Neural Networks Anshika Chaudhary, Himangi Mittal, Anuja Arora [International Conference on Machine Learning, Big Data, Cloud and Parallel Computing , 2019] Paper / Code A method to capture the anomalous behavior in a social network based on degree, betweenness, and closeness of graph nodes using Graph Neural Networks (GNN) in Keras.
	STWalk: Learning Trajectory Representations in Temporal Graphs Supriya Pandhre, Himangi Mittal Manish Gupta, Vineeth N. Balasubramanian [ACM India Joint International Conference on Data Science and Management of Data (CoDS-COMAD), 2018] Paper / ArXiv / Code Presents trajectory analysis of spatio-temporal graph nodes using DeepWalk algorithm in NetworkX (Python) for classification and detecting changing points of interest using SVMs.
	Harnessing emotions for depression detection Sahana Prabhu Muraleedhara Himangi Mittal, Rajesh Varagani, Sweccha Jha, Shivendra Singh [Pattern Analysis and Applications Journal] Paper A method for multi-modal depression detection using audio, video, and textual modalities using LSTMs. This work leverages emotions to detect an early indication of depression.

Academic Service/Volunteer Work

Reviewer Service: ICCV 2021, AAAI 2022, WACV 2022, CVPR 2022, CVPR 2023 (+ Emergency reviewer), ICCV 2023, NeurIPS 2023, Pattern Recognition Journal, WACV 2024 (+ Emergency reviewer), ACCV 2024, CVPR 2024, ICLR 2024, ICML 2024, WACV 2025, CVPR 2025, ICLR 2025, ICML 2025, BMVC 2025, NeurIPS 2025, ICLR 2026, CVPR 2026.
Workshop Service: Member of the organizing committee at WiCV@CVPR 2025, WiCV@CVPR 2024, DEI Social Event, and Challenges/Opportunities for ECRs in Fast Paced AI Social Event!
Meta Reviewer Service: WiCV@CVPR 2025, WiCV@CVPR 2024.
Teaching Assistant: 16-720A: Computer Vision (Fall 2025), 16-824: Visual Learning and Recognition (Spring 2024), 16-825: Learning for 3D Vision (Spring 2023).
Mentor at CMU AI Undergraduate Mentoring Program (Fall 2022, Spring 2023, Fall 2023, Spring 2024, Fall 2024, Spring 2025).
Mentor at Spring 2023 CMU Research Mixer for undergraduate students organized by DPAC Undergraduate Research Working Group.
Volunteer at NeurIPS 2022 High School Outreach Program.

Teaching

Teaching Assistant for 16-720A: Computer Vision (Fall 2025)
Teaching Assistant for 16-824: Visual Learning and Recognition (Spring 2024)
Teaching Assistant for 16-825: Learning for 3D Vision (Spring 2023)

Source Code