Ambient Diffusion Policy

Ambient Diffusion Policy:
Imitation Learning from Suboptimal Data in Robotics

Adam Wei, Nicholas Pfaff, Thomas Cohn, Arif Kerem Dayı, Constantinos Daskalakis, Giannis Daras, Russ Tedrake

TL;DR

Suboptimal data in robotics still contains useful learning signal. Ambient Diffusion Policy is a principled method for extracting meaningful features from these data sources while ignoring harmful features. The algorithm leverages noise-dependent data usage: suboptimal samples can only contribute to training at high and low diffusion times. Its implementation is remarkably simple, requiring only a single change to the Diffusion Policy dataloader.

Detailed walkthrough

1. Start With Robot Data

High-quality, task-specific robot data for imitation learning is scarce. Instead, suboptimal or OOD action data is abundant: e.g. noisy or non-expert teleop, simulation, cross-embodied data, and large collections such as Open X-Embodiment. We first show that Diffusion Policy learns different features of the data at different noise levels. Thus, Ambient Diffusion Policy restricts the use of suboptimal data to noise levels where its features align with the high-quality data.

Robot action diffusion exhibits a global-to-local hierarchy.

At high noise, the optimal denoiser learns to recover global task-level structure. At low noise, the optimal denoiser learns to refine local motion-level primitives.

  • (a) Image analogy. This hierarchy is analogous to the coarse-to-fine hierarchy in image diffusion. Both hierarchies are the result of a spectral power law in the data.
  • (b) Maze actions. The Diffusion Policy first plans a path through the maze. Then, it refines local features to ensure the actions are collision-free and smooth.
  • (c) Block sorting. The Diffusion Policy first plans which block to grasp and where to place it. Then, it refines the grasping motion and smoothness.

Thus, Diffusion Policies learn different features of the data at different noise levels. Ambient leverages this to learn only the useful features from suboptimal data.

Global-to-local hierarchy in action diffusion across maze and bin sorting examples.

2. Use the Noise Level as a Filter

Ambient Diffusion Policy allows suboptimal data to contribute to training along two intervals of the diffusion process. tmax and tmin are hyperparameters.

Ambient data usage rule across diffusion time: target data is always used, while suboptimal data is used at low-noise locality windows and high-noise contraction windows.
Ambient Diffusion Policy trains with high-quality data (𝒟p) everywhere; suboptimal data (𝒟q) is restricted to low or high noise levels, where distributional differences are either ignored due to locality or masked due to noise.
Locality · t ∈ [0, tmax)

At low noise, the optimal denoiser only attends to a local temporal neighborhood around each action and ignores the global task structure. Thus, it is safe to learn motion primitives at low noise.

The low noise interval is best for suboptimal data with high-quality motions, but mismatched task descriptions.

Contraction · t ∈ (tmin, T]

At high noise, high-quality and suboptimal noisy action chunks become indistinguishable. Thus, it is safe to use suboptimal data at high noise.

The high noise interval is best for suboptimal data with low-level motion suboptimality (jitter, non-expert manipulation, etc.).

Low noise
High noise

3. Changes to Diffusion Policy

Ambient is a drop-in change to Diffusion Policy. Everything stays standard — only the data sampler changes.

Training loop ✓ No changes
Architecture ✓ No changes
Inference ✓ No changes
Data sampler Simple modification
Phase 1 (offline)

Dataset Annotation

Annotate tmin and tmax for each datapoint. This can be done with a sweep, or with a classifier that predicts whether a noised action was high-quality: the minimum amount of noise required to confuse the classifier is tmin. An analogous method gives tmax.

Phase 2 (training-time)

Data Sampling

  1. Sample a batch of diffusion times.
  2. Sample admissible samples from the high-quality and suboptimal data. High-quality samples are always admissible; suboptimal samples are admissible only if t(i) ∈ [0, tmax) ∪ (tmin, T].

4. Experiments & Results

Generality. Ambient Diffusion Policy outperforms baselines when training on three common types of action suboptimality: noisy demonstrations, sim-to-real gap, and task mismatch (shown in green).

Scale. When trained on Open X-Embodiment—a large dataset with mixed data quality and unstructured distribution shifts—Ambient outperforms the co-training baseline by up to 33% on two real-world tasks (shown in purple). Additionally, Ambient continues to improve as we scale the amount of suboptimal data in the training mixture, whereas co-training plateaus.

Overview of controlled experiments and hardware tasks.
We evaluate Ambient Diffusion Policy on four types of action suboptimality, six different tasks, and scale to the Open X-Embodiment (OXE).
Controlled experiments

Each experiment demonstrates Ambient Diffusion Policy's effectiveness on an isolated distribution shift.

Noisy trajectories

Ambient can learn smooth trajectories from noisy trajectories.

We train a policy using 50 smooth trajectories and 5,000 jittery trajectories from RRT. Co-training trains a policy with a 99.4% success rate but low smoothness. Ambient trains a policy with 99.5% success rate that is also smooth.

Co-trained and Ambient maze rollouts.

Sim-and-Real Co-training

Ambient learns manipulation strategies from simulation without learning the incorrect contact dynamics.

We train on 50 demos in a target environment and 2,000 demos in a simulated environment. The best co-trained policy achieves a success rate of 84.5%; the best Ambient Diffusion Policy achieves a success rate of 93.5%.

Planar pushing results comparing co-training and Ambient variants.

Task Mismatch

Ambient learns useful motion primitives for demos from the wrong task.

In a block sorting task, we collect 50 demos with the correct sorting logic and 200 demos with the incorrect sorting. Ambient learns the grasping primitives from the incorrect data without learning the incorrect logic; co-training does not.

Block sorting metrics across Ambient thresholds.
Scaling and Real-World Experiments

Can Ambient use a broader OXE mixture instead of filtering back to the curated one?

Open X-Embodiment scaling

Broader OXE data helps under Ambient, not plain co-training.

On Kuka hardware, Ambient improves table cleaning by up to 15% and builds towers up to 33% taller than co-training.

Magic Soup++ is the curated 27-dataset OXE mix used by OpenVLA. Our Custom OXE keeps that mix and adds 21 compatible datasets; it is broader, less filtered. We only exclude datasets that were unavailable, bimanual, missing frequency labels, or blocked by loader issues.

Co-training learns both the good and bad parts of suboptimal data; it plateaus as more suboptimal data is added. Ambient Diffusion Policy scales as more unfiltered data is added to the suboptimal datasets.

Table cleaning and tower building results with Open X-Embodiment data.

Key Takeaways: Stop choosing between filtering and blanket co-training.

High-quality robot data is scarce, and suboptimal data keeps growing alongside it — so learning from suboptimal sources is a fundamental, persistent challenge, not a temporary artifact of today's data scarcity. Yet suboptimal robot data is rarely all good or all bad:

  • Filtering throws away data that may still contain useful motions, strategies, or priors.
  • Co-training uses every source at every noise level, so the policy learns both the meaningful and the harmful parts of suboptimal distributions.
  • Ambient brings noise-dependent data usage to robotics: each source contributes only at the diffusion noise levels where it matches the target task.

With Ambient Diffusion Policy, the question shifts from which data sources to use, to when each should be used in the diffusion process.

Example Hardware Rollouts

Example real-world rollouts from the hardware experiments.

Table cleaning: Ambient

Table cleaning: co-training

Tower building: Ambient

Tower building: co-training

Ablations

Is Ambient sensitive to re-weighting?

Ambient Diffusion Policy is significantly less sensitive to dataset re-weighting than co-training. Without re-weighting, the performance of Ambient decreases by 9%, but the co-training policies were unsafe to evaluate on hardware.

Ambient vs Finetuning?

Finetuning and Ambient can be used together. Finetuning an Ambient Diffusion Policy outperforms finetuning a co-trained base policy. When data is scarce, the Ambient Diffusion Policy without finetuning even outperforms the co-trained policy with finetuning.

Can Ambient Diffusion Policy handle distribution shifts in the observations?

Not explicitly, although the method still works remarkably well. We present ablations in the paper.

See the paper for the full set of ablations.

Abstract

We propose Ambient Diffusion Policy, a simple and principled method for imitation learning from suboptimal data in robotics. High-quality, task-specific robot data is expensive and time-consuming to collect, while suboptimal datasets with lower-quality or out-of-distribution demonstrations are abundant. Existing methods that co-train on both data sources in robotics often fail to separate the meaningful and the harmful features in the suboptimal samples. In contrast, our method extracts only the useful features by introducing a new axis to co-training in robotics: noise-dependent data usage.

Ambient Diffusion Policy restricts the contribution of suboptimal data during training to only the high and low diffusion times. To rigorously justify our approach, we first observe that robot action data exhibits a spectral power law. This induces two important properties on the optimal Diffusion Policy that we exploit: a global-to-local hierarchy and locality. We theoretically formalize this discussion using a simplified model.

Our experiments validate Ambient Diffusion Policy on four types of suboptimal action data (noisy trajectories, sim-to-real gap, task mismatch, and large-scale data mixtures) across six tasks. The results show that it effectively learns from arbitrary sources of suboptimal data. Notably, it outperforms existing co-training baselines by up to 33% when scaled to Open X-Embodiment—a large dataset with heterogeneous data quality and unstructured distribution shifts. Overall, Ambient Diffusion Policy increases the utility of suboptimal demonstrations and expands the set of usable data sources in robotics.

Citation

@misc{wei2026ambientdiffusionpolicyimitation,
      title={Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics},
      author={Adam Wei and Nicholas Pfaff and Thomas Cohn and Arif Kerem Dayı and Constantinos Daskalakis and Giannis Daras and Russ Tedrake},
      year={2026},
      eprint={2606.12365},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.12365},
}