Ph.D. Candidate · Texas A&M University

Ujwal Dinesha

I work on reinforcement learning — from post-training large language models to learning policies for restless bandits and robots. Advised by Dr. Srinivas Shakkottai.

Austin, TX ujwald36@tamu.edu

Email Scholar LinkedIn GitHub

About

I’m a fourth-year Ph.D. candidate in Computer Engineering at Texas A&M, where I think a lot about how learning agents make decisions under uncertainty. Lately that’s meant chasing two threads in parallel: aligning large language models cheaply at inference time, and learning index policies for restless bandits with only preference feedback.

Before A&M, I picked up an M.S. in Electrical Engineering at Columbia University and a B.E. at the National Institute of Engineering, Mysuru. I’ve also done research internships at Typeface AI, InterDigital, and Roche.

Python PyTorch Hugging Face vLLM TorchRL RLHF Preference Optimization LLM Post-Training Inference-Time Alignment Bandits / RMABs Multi-Agent RL Distributed GPU / HPC Slurm W&B LaTeX

Research

Selected publications, each with a short writeup. Click any entry to expand the blog-style notes.

arXiv · 2025

PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training

S. C. Bobbili, U. Dinesha, D. Narasimha, S. Shakkottai

Paper Code Read notes ↓

TL;DR. Most alignment pipelines stack two fragile pieces — fit a reward model on noisy preference data, then fine-tune the LLM against it. PITA throws both out. We learn a small guidance policy that nudges the base model’s next-token distribution at inference time, directly from preferences, with zero LLM fine-tuning.

The trick is to frame alignment as identifying a latent preference distribution and solve it with stochastic search. The guidance model produces exponentially-weighted Q-values that re-shape the LLM’s logits on the fly. The result: a much cheaper alignment recipe that side-steps reward-model instability.

We test PITA on math reasoning, TL;DR summarization, and sentiment control. The base LLM stays frozen the whole time — you can swap it out, swap the guidance policy in, and keep going.

ICLR · 2025

DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback

G. Xiong, U. Dinesha, D. Mukherjee, J. Li, S. Shakkottai

Paper Code Poster Read notes ↓

TL;DR. Restless multi-armed bandits (RMABs) usually assume the decision-maker observes a clean scalar reward when they pull an arm. In the real world — healthcare, online ads, app engagement — you almost never get that. You get which option did the user prefer. DOPL is the first online algorithm for RMABs in this setting with provable regret.

We introduce Pref-RMAB, where the planner sees only pairwise preference feedback between activated arms. DOPL interleaves exploration of the unknown transition kernels with preference-data collection, then plans directly against preferences instead of trying to reconstruct rewards. The analysis lands at Õ(√(T · ln T)) regret.

Practically, we show it works on three benchmark domains — a CPAP adherence model, the Armman maternal-health setup, and an app-marketing engagement environment. Codebase is built around a config-driven runner so it’s easy to plug in your own RMAB environment.

NeurIPS · 2024

Risk-Averse Finetuning of Large Language Models

S. Chaudhary, U. Dinesha, D. Kalathil, S. Shakkottai

Paper Code Poster Read notes ↓

TL;DR. Standard RLHF maximizes expected reward, so it’s perfectly happy with a model that’s great on average but occasionally spits out something toxic. RA-RLHF swaps the objective for Conditional Value at Risk — the average reward of the worst-α tail — which directly punishes those rare bad generations.

The fix is surprisingly clean: at each PPO step, we sort the batch by reward and weight the tail more aggressively. No new architecture, no new reward model, drop-in over TRL.

On IMDB sentiment-control and Jigsaw toxicity-mitigation benchmarks, RA-RLHF cuts the rate of toxic / off-target outputs without giving up on the generation quality you’d get from vanilla RLHF.

INFOCOM · 2024

A Multi-Agent View of Wireless Video Streaming with Delayed Client-Feedback

N. Khan, U. Dinesha, S. Arunachalam, D. Narasimha, V. Subramanian, S. Shakkottai

Paper PDF Read notes ↓

TL;DR. A base station is streaming video to N devices over a shared wireless link with a tight energy budget. Each device’s buffer / channel state arrives back at the BS with a deterministic delay. That delay is the whole problem — the controller is acting on stale information.

We cast this as a cooperative multi-agent constrained POMDP and use strong duality to split it into N independent transmitter-receiver pairs that all share a single Lagrange multiplier (the energy price). Each pair is then solved using the common-information approach with approximate information states.

The neural architecture that drops out of this analysis is delay-aware by construction and beats transmitter-only baselines in simulation. It’s a nice example of how a careful POMDP decomposition can give you tractable learning where the naive joint problem would be hopeless.

NSDI · 2024

EdgeRIC: Empowering Real-time Intelligent Optimization and Control in NextG Cellular Networks

W. H. Ko, U. Ghosh, U. Dinesha, R. Wu, S. Shakkottai, D. Bharadia

Paper Code Site Read notes ↓

TL;DR. Open RAN’s near-real-time RIC sits in the cloud and takes > 15 ms to decide anything. That’s far too slow for per-TTI scheduling in a real cell. EdgeRIC pushes the RIC down to the edge, co-located with the DU, and gets the decision loop under 1 ms.

The system exposes a real-time E2 interface (we call it RT-E2) built on ZMQ + protobuf, with TTI-level synchronization and per-UE KPI reports. Above that we run μApps — small policies, including a PPO-trained MAC scheduler — that can react fast enough to actually matter.

End-to-end, the AI-driven scheduler beats classical heuristics (max-CQI, proportional fairness, round-robin) by 5–25% on throughput and downstream application metrics, all on real srsRAN hardware.

Patents

Variable-Length CSI Feedback via Encoder Compression and Output Masks WO2025014678A1
Intelligent Control for Cellular Radio Access Networks US20240267794A1
Reinforcement Learning-based Rate Control for End-to-End Neural Network based Video Compression WO2024064329A1
Video Compression for Both Machine and Human Consumption Using a Hybrid Framework WO2024049627A1
Temporal Attention-Based Neural Networks for Video Compression WO2023122077A1

Experience

Applied Scientist Intern · Typeface AI

Palo Alto, CA · Jun 2025 – Aug 2025
- Adapted a latent-preference inference approach from academic research to Typeface’s marketing-content use cases — personalizing future LLM generations from users’ past edits, without retraining the base model.
- Built synthetic datasets and ran offline experiments to validate the approach and pin down parameter choices.
- Shipped the personalization logic into the production codebase and deployed it to beta behind a feature flag, awaiting A/B testing.
Research Intern, AI Lab · InterDigital

Los Altos, CA · Mar–Aug 2021, May–Aug 2022
- Worked on deep-learning-based video compression for live streaming.
- Improved inter-frame prediction with convolutional-recurrent blocks and attention mechanisms.
- Built an RL rate-controller for end-to-end neural video compression (now a patent).
Deep Learning Intern · Roche

Little Falls, NJ · May 2020 – Aug 2020
- Weakly-supervised deep learning on gigapixel histopathology images for cancer detection (ResNet34 encoder + RNN aggregator).
- Hit 99.3% AUC on an in-house kidney-cancer dataset; presented to the pRED Innovation Center data-science team in New York.

Contact

The fastest way to reach me is email. I’m always happy to chat about RL, LLM post-training, bandits, or robotics.