Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow

KAIST

TL;DR - Drifting Field Policy (DFP) is a novel one-step generative policy that avoids the trajectory-level credit assignment of diffusion policies in RL fine-tuning.

Drifting Field Policy teaser: particle x is transported across a Q-value landscape (blue low, red high) toward the top-K high-value actions sampled from the old policy.

DFP transports the current policy samples $\pi_\theta$ across the $Q$-value landscape (blue: low, red: high) toward the top-$K$ high-value actions sampled from $\pi_{\mathrm{old}}$, following the drifting field $\mathbf{V} = \mathbf{V}_p^{+} - \mathbf{V}_q^{-}$ — attraction toward high-$Q$ actions, repulsion from current samples.

Output-to-Trajectory Credit Assignment in Diffusion/Flow Policies

A generative action policy must capture (1) multimodal actions, (2) run in real time, and (3) improve through RL. Few-step diffusion/flow policies largely solve the first two; the remaining tension is RL post-training. Given a current state $s$, they generate an action $a$ by integrating a time-indexed velocity field along an ODE trajectory:

$$ a = x_1 = x_0 + \int_0^1 v_\theta(x_t, t, s)\,dt, \qquad x_0 \sim \mathcal{N}(0, I). $$

In RL fine-tuning, the reward signal $Q_\phi(s, a)$ is defined on the final executed action $a = x_1$, yet the policy is parameterized by the intermediate velocities $v_\theta(x_t, t, s)$ along the whole trajectory. This raises a hard output-to-trajectory credit assignment problem:

"Which intermediate velocity should receive how much credit?"

Crucially, this burden persists even for one-step diffusion variants (MeanFlow, Consistency Models): sampling is one-step, but training remains trajectory-level, since all intermediate shortcuts must align with the same endpoint.

Why a Drifting Policy for RL Fine-Tuning?

DFP and few-step diffusion/flow policies both map noise to an action in a single pass, but parameterize that map differently. Two structural consequences make the non-ODE drifting policy especially suited to RL fine-tuning:

  • Direct action updates vs. velocity re-fitting. DFP supervises the action output directly, so a reward signal shifts the policy in one step. ODE policies — even one-step variants — still learn a time-indexed velocity field, so improving the policy means globally re-fitting the fields across the whole ODE trajectory, under a self consistency constraint and trajectory-level credit assignment.
  • Built-in repulsion from current samples. Each update of DFP not only pulls actions toward high-value targets, but also pushes them away from the current policy's own (typically lower-value) samples — a mechanism absent in diffusion/flow policies.

Drifting Field Policy

Built on drifting models (Deng et al., 2026), Drifting Field Policy (DFP) directly parameterizes the action distribution with a single-pass pushforward map $f_\theta$, with no time variable and no ODE trajectory.

A. One-Step Pushforward Policy

$$ a = f_\theta(\epsilon, s), \quad \epsilon \sim p_\epsilon, \qquad \pi_\theta(\cdot \mid s) = [f_\theta(\cdot, s)]_\# \, p_\epsilon. $$

A single forward pass maps a noise sample directly to an action. Compared to ODE-based generation, this gives (i) one-step action generation and (ii) no ODE trajectory → no output-to-trajectory credit assignment.

B. Drifting Field for Distribution Matching

Let $p$ be a target distribution and $q := [f_\theta]_\# p_\epsilon$ the pushforward distribution. The drifting field moves generated particles $x$ toward positives $p$ and away from negatives $q$, built from kernel mean shifts:

$$ \mathbf{V}_{p,q}(x) = \mathbf{V}_p^{+}(x) - \mathbf{V}_q^{-}(x) = \frac{\mathbb{E}_{y^{+}\sim p}\!\left[k(x,y^{+})(y^{+}-x)\right]} {\mathbb{E}_{y^{+}\sim p}\!\left[k(x,y^{+})\right]} - \frac{\mathbb{E}_{y^{-}\sim q}\!\left[k(x,y^{-})(y^{-}-x)\right]} {\mathbb{E}_{y^{-}\sim q}\!\left[k(x,y^{-})\right]}. $$

Here $k(\cdot,\cdot)$ is a similarity kernel, e.g., the Gaussian kernel $k(x,y) = \exp\!\left(-\|x-y\|^2 / 2h^2\right)$.

$f_\theta$ is trained by fixed-point regression that drives the drifting field to zero (with a stop-gradient on the drifted target), so that $q \to p$. This gives the general drifting training objective, defined for any target/source pair $(p, q)$:

$$ \mathcal{L}_{\mathrm{drift}}(\theta; p, q) = \mathbb{E}_{\epsilon\sim p_\epsilon} \big\| x - \mathrm{sg}\!\left(x + \mathbf{V}_{p,q}(x)\right) \big\|^2, \quad x = f_\theta(\epsilon). $$

Since $\mathcal{L}_{\mathrm{drift}}$ is parameterized by the positive (target) $p$ and the negative (source) $q$, the same objective applies to any choice of $(p, q)$ — a flexibility we exploit next by plugging in different targets for policy improvement and behavior cloning.

C. RL Fine-Tuning as Distribution Matching

RL fine-tuning targets the reward-tilted policy

$$ \pi^{+}(\cdot \mid s) = \arg\max_{\pi}\; \mathbb{E}_{a\sim\pi(\cdot\mid s)}\!\left[Q_\phi(s,a)\right] - \alpha\, D_{\mathrm{KL}}\!\left(\pi(\cdot\mid s)\,\|\,\pi_{\mathrm{old}}(\cdot\mid s)\right) = \frac{\pi_{\mathrm{old}}(a\mid s)\exp\!\left(Q_\phi(s,a)/\alpha\right)}{Z(s)}. $$

Policy improvement is then just the general drift objective $\mathcal{L}_{\mathrm{drift}}$ from (B), instantiated with the negative source $q = \pi_\theta$ and the positive target $p = \pi^{+}$ (whereas $p = p_{\mathrm{data}}$ recovers behavior cloning):

$$ \mathcal{L}_{\mathrm{PI}}(\theta) = \mathcal{L}_{\mathrm{drift}}(\theta; \pi^{+}, \pi_\theta) = \mathbb{E}_{s,\epsilon} \big\| \hat a - \mathrm{sg}\!\left(\hat a + \mathbf{V}_{\pi^{+},\pi_\theta}(\hat a)\right) \big\|^2, \quad \hat a = f_\theta(\epsilon, s). $$
DFP directly updates the generated actions toward high-value regions, without output-to-trajectory credit assignment.

D. Interpretation of the DFP Update

$$ \mathbf{V}_{\pi^{+},\pi_\theta}(a\mid s) = h^2\!\left(\nabla_a \log \pi^{+}_{\mathrm{kde}}(a\mid s) - \nabla_a \log \pi_{\theta,\mathrm{kde}}(a\mid s)\right) \simeq \underbrace{\tfrac{h^2}{\alpha}\nabla_a Q_\phi(s,a)}_{\nabla_a Q\ \text{ascent}} + \underbrace{h^2\left(\nabla_a \log \pi_{\mathrm{old}} - \nabla_a \log \pi_\theta\right)}_{\text{trust region around } \pi_{\mathrm{old}}}. $$

This decomposition follows in two steps:

  1. Drifting field = KDE-approximated score matching. The kernel mean shift equals the difference of KDE-smoothed scores, $\mathbf{V}_{p,q} \simeq h^2(\nabla_a\log p_{\mathrm{kde}} - \nabla_a\log q_{\mathrm{kde}})$ (Cao et al., 2026; Cheng, 1995).
  2. Substitute the score of $\pi^{+}$. Plugging $\nabla_a\log\pi^{+} = \nabla_a\log\pi_{\mathrm{old}} + \tfrac{1}{\alpha}\nabla_a Q_\phi$ yields (1) a $\nabla_a Q$ ascent and (2) a trust region around $\pi_{\mathrm{old}}$ via score matching.

E. Tractable Top-$K$ Surrogate

The ideal target $\pi^{+}$ is intractable due to the normalizing constant $Z(s)$. DFP approximates it with an empirical top-$K$ set of high-value actions sampled from $\pi_{\mathrm{old}}$:

$$ P_K(s) := \mathrm{TopK}_{j}\, Q_\phi\!\left(s, a^{(j)}\right), \qquad a^{(1)}, \dots, a^{(N)} \sim \pi_{\mathrm{old}}(\cdot \mid s), $$ $$ \mathcal{L}_{\text{top-}K}(\theta; P_K, \pi_\theta) = \mathbb{E}_{s,\epsilon} \big\| \hat a - \mathrm{sg}\!\left(\hat a + \mathbf{V}_{P_K,\pi_\theta}(\hat a)\right) \big\|^2. $$

Practically, each update proceeds in four steps:

  1. Sample $N$ candidates from $\pi_{\mathrm{old}}(a\mid s)$
  2. Score them with the critic $Q_\phi$
  3. Select the top-$K$ high-value actions
  4. Update $\pi_\theta$ toward the top-$K$ region

In practice, this top-$K$ loss is what we optimize — effectively behavior cloning on the top-$K$ critic-selected actions, with a bounded approximation error to the ideal update $\mathcal{L}_{\mathrm{PI}}$. (See the paper for more details.)

Results

Across 12 Robomimic and OGBench manipulation tasks (offline-to-online RL), DFP outperforms both multi-step and one-step diffusion/flow policies, reaching the best average success rate with single-pass inference (+15.5 pp over its closest baseline, MVP).

Main Results

Success rate (%) on Robomimic and OGBench tasks under the offline-to-online RL setting. Each cell reports the mean over 5 seeds. Best results are shown in bold; second-best results are underlined.

Method Robomimic Cube-double Cube-triple Cube-quadruple-100m Avg.
lift square can task2 task3 task4 task2 task3 task4 task2 task3 task4
BFN 97.6±2 32.8±8 82.0±2 86.0±5 88.8±5 27.2±8 7.6±9 6.8±3 0.0±0 32.4±21 0.0±0 0.0±0 38.4
QC-BFN 99.6±1 88.4±4 90.6±3 99.8±0 99.8±0 92.6±6 87.4±10 80.8±4 33.4±9 95.8±2 63.2±10 74.2±11 83.8
FQL 96.8±2 10.8±7 58.4±8 93.2±8 91.2±5 6.0±6 0.4±1 6.4±8 0.0±0 0.0±0 0.0±0 0.0±0 30.3
QC-FQL 100.0±0 72.0±9 94.4±2 100.0±0 99.8±0 99.8±0 88.2±2 60.4±12 51.4±24 98.0±2 85.0±7 92.2±7 86.8
MVP 99.8±0 79.4±4 83.6±5 98.4±1 98.6±1 94.8±4 86.2±4 57.2±10 31.0±20 96.6±2 47.2±30 91.2±2 80.3
DFP (Ours) 100.0±0 93.2±2 90.6±3 100.0±0 99.6±1 99.6±1 98.4±1 91.6±2 81.2±6 99.6±1 96.6±2 99.0±2 95.8

Training Curves

Gray and white background indicate offline and online phases, respectively. DFP not only attains a higher final success rate but also converges faster.

Full offline and online training curves

Top-$K$-of-$N$ Ablation

With $N = 16$, DFP is robust across $K$, with the best performance at $K = 4$. All DFP variants outperform MVP (80.3 avg.).

$K$ (of $N=16$) Robo. Cube-2. Cube-3. Cube-4. Avg.
$K = 1$ 95.099.754.594.2 85.8
$K = 2$ 94.6100.076.296.8 91.9
$K = 4$ 93.999.790.498.4 95.6
$K = 8$ 88.699.885.596.2 92.5

Qualitative Rollouts

Representative DFP rollouts on Robomimic and OGBench manipulation tasks.

Robomimic Square Pick-and-place manipulation
Cube-double Two-cube rearrangement
Cube-triple Three-cube rearrangement
Cube-quadruple Long-horizon four-cube rearrangement

BibTeX

@article{koo2026driftingfieldpolicy,
  title={Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow},
  author={Koo, Juil and Park, Mingue and Choi, Jiwon and Min, Yunhong and Sung, Minhyuk},
  journal={arXiv preprint arXiv:2605.07727},
  year={2026}
}