TL;DR. Most alignment pipelines stack two fragile pieces — fit a reward model on noisy preference data, then fine-tune the LLM against it. PITA throws both out. We learn a small guidance policy that nudges the base model’s next-token distribution at inference time, directly from preferences, with zero LLM fine-tuning.
The trick is to frame alignment as identifying a latent preference distribution and solve it with stochastic search. The guidance model produces exponentially-weighted Q-values that re-shape the LLM’s logits on the fly. The result: a much cheaper alignment recipe that side-steps reward-model instability.
We test PITA on math reasoning, TL;DR summarization, and sentiment control. The base LLM stays frozen the whole time — you can swap it out, swap the guidance policy in, and keep going.