I am a Ph.D. student in the Robotics Institute (RI) at Carnegie Mellon University (CMU), working with Prof. Shubham Tulsiani.
My research focuses on building physics-grounded AI agents, video generation via preference-based RL optimization, and differentiable simulation.
I am currently collaborating with Microsoft on Computer-Use-Agents (CUA), GUI grounding, Coding Agents, and 4D asset generation research.
I graduated with a Master of Science in Robotics (MSR) from
the Robotics Institute at Carnegie Mellon University where I worked with Prof. Abhinav Gupta and collaborated with Prof. Pedro Morgado at UW-Madison. Before my Master's, I worked as a Research Assistant at CMU with
Prof. David Held at the R-Pad Lab, in collaboration with Pittsburgh-based autonomous driving company,
Argo AI.
During my Masters at CMU, I had worked on self-supervised representation learning methods for multimodal audio-visual videos and as a RA at CMU, I worked on self-supervised algorithms for real-world 3D LiDAR point clouds.
We study pixel-precise GUI grounding for Computer Use Agents in dense coding interfaces. Instead of single-shot coordinate prediction, our agent iteratively refines cursor localization using visual feedback from previous attempts, enabling self-correction of displacement errors. Evaluated across GPT-5.4, Claude, and Qwen, multi-turn refinement significantly outperforms single-shot models in click precision and task success.
Track4DGen is a two-stage framework that integrates foundation point-tracker motion priors into multi-view video diffusion and 4D Gaussian Splatting reconstruction. By enforcing dense feature-level point correspondences during generation and augmenting reconstruction with tracker-derived motion encoding and 4D Spherical Harmonics, it produces temporally stable, text-editable 4D assets surpassing existing baselines.
UniPhy is a unified latent-conditioned neural model which learns a common latent space to encode the properties of diverse materials. At inference, given motion observations for a system with unknown material parameters, UniPhy allows material inference via differentiable simulation latent optimization.
We leverage a large video-language model for anticipating action sequences that are plausible in the real-world. We develop the understanding of plausibility of an action sequence in a large video-language model by introducing two objective functions, a counterfactual-based plausible action sequence learning loss and a long-horizon action repetition loss.
We propose a self-supervised algorithm to learn representations from untrimmed, egocentric videos containing audible interactions.
Our method uses the audio signals in two unique ways: (1) to identify moments in time that are conducive to better self-supervised learning
and (2) to learn representations that focus on the visual state changes caused by audible interactions.
A self-supervised method to complete the incomplete, partial point clouds for real-world settings like LiDAR where ground truth complete point cloud
annotations are unavailable. We achieve this via inpainting where a region of the point cloud is removed and the network is trained to complete this removed region.
A method of training scene flow that uses two self-supervised losses, based on nearest neighbors and cycle consistency.
These self-supervised losses allow us to train our method on large unlabeled autonomous driving datasets.
Predicted action and spatial relationships in images between objects detected by YOLO, then combining VGG-Net based visual features and
Word2Vec based semantic features.
A method to capture the anomalous behavior in a social network based on degree, betweenness, and closeness of graph nodes using
Graph Neural Networks (GNN) in Keras.
Presents trajectory analysis of spatio-temporal graph nodes using DeepWalk algorithm in NetworkX (Python) for classification and detecting
changing points of interest using SVMs.
A method for multi-modal depression detection using audio, video, and textual modalities using LSTMs. This work leverages emotions to detect an early indication of
depression.