Danny Driess (@DannyDriess) / X

Danny Driess

169 posts

Danny Driess

@DannyDriess

Research Scientist @physical_int. Formerly Google DeepMind

dannydriess.github.io

Joined August 2021

Pinned
Danny Driess
@DannyDriess
May 28, 2025
How to build vision-language-action models that train fast, run fast & generalize? In our new paper, we formalize & analyze the approach of our π-0.5 model & further improve it with a single stage recipe. Blog: pi.website/research/knowl… Paper: pi.website/download/pi05_…
00:00
20K
Danny Driess
@DannyDriess
Mar 7, 2023
What happens when we train the largest vision-language model and add in robot experiences? The result is PaLM-E 🌴🤖, a 562-billion parameter, general-purpose, embodied visual-language generalist - across robotics, vision, and language. Website: palm-e.github.io
00:00
1.3M
Danny Driess
@DannyDriess
Oct 31, 2024
Look at what our novel Vision-Language-Action Flow Model can do! We @physical_int developed an architecture that bridges the gap between internet-scale pre-training and robot dexterity
00:00
191K
Danny Driess
@DannyDriess
Mar 7, 2023
Replying to @DannyDriess
PaLM-E enables robot planning directly from pixels – all in a single model, trained end-to-end. Here the model is guiding a robot to get a chip bag from a kitchen. Being integrated into the control loop, PaLM-E is robust to disturbances happening during the robot’s journey.
00:00
31K
Danny Driess
@DannyDriess
Mar 7, 2023
Replying to @DannyDriess
Perhaps most exciting about PaLM-E is **positive transfer**: simultaneously training PaLM-E across several domains, including internet-scale general vision-language tasks, leads to significantly higher performance compared to single-task robot models.
138K
Danny Driess
@DannyDriess
Jan 16, 2025
Most large models are built around transformers and tokenizers. We @physical_int figured out how to come up with a robot action tokenizer that enables to train auto regressive models on dexterous robot tasks and is 5x more efficient! pi.website/research/fast
11K
Danny Driess
@DannyDriess
Mar 7, 2023
Replying to @DannyDriess
We observe a notable trend with model scale: the larger the language model, the more it maintains its language capabilities when training on visual-language and robotics tasks – quantitatively, the 562B PaLM-E model nearly retains all of its language capabilities.
28K
Danny Driess
@DannyDriess
Mar 7, 2023
Replying to @DannyDriess
PaLM-E is the largest VLM reported to date. We observe emergent capabilities like multimodal chain of thought reasoning, and multi-image inference, despite being trained on only single-image prompts. Though not the focus of our work, PaLM-E sets a new SOTA on OK-VQA benchmark.
00:00
12K
Danny Driess
@DannyDriess
Mar 7, 2023
Replying to @DannyDriess
In a different domain, here the **same** exact PaLM-E model is controlling a robot in real-time. This robot recently required human assistance to guide it through very long-horizon tasks (interactive-language.github.io), but now PaLM-E can learn these tasks autonomously.
00:00
13K
Danny Driess
@DannyDriess
Jul 28, 2023
Vision-Language-Action models are out! We introduce RT-2, where we train (large) vision-language models on both robot actions and internet-scale general vision-language tasks simultaneously! deepmind.com/blog/rt-2-new-…
GIF
9.8K
Danny Driess
@DannyDriess
Mar 7, 2023
Replying to @DannyDriess
And that’s PaLM-E: a one-model generalist across robotics, language, and vision-language.
11K
Danny Driess
@DannyDriess
Mar 9, 2022
Excited to share our preprint "Learning Multi-Object Dynamics with Compositional Neural Radiance Fields" Paper: arxiv.org/abs/2202.11855 Videos: dannydriess.github.io/compnerfdyn/ Amazing collaboration between @DannyDriess, @YunzhuLiYZ, @huang_zhiao, Russ Tedrake, Marc Toussaint. (1/7)
00:00
Danny Driess
@DannyDriess
Mar 7, 2023
Replying to @DannyDriess
The inputs to PaLM-E are multimodal sentences that interleave text, images, states, or other continuous encodings. These multimodal sentences are passed as inputs to an LLM for next token prediction, trained end-to-end. For PaLM-E-562B, it combines PaLM-540B and ViT-22B.
11K
Danny Driess
@DannyDriess
Sep 12, 2022
The recordings of our RSS 2022 Workshop on Implicit Representations for Robotic Manipulation are now online! Playlist: youtube.com/playlist?list=… Web: imrss2022.github.io Thanks again to our speakers @DanicaKragic @vincesitzmann @andyzengtweets @IMordatch @animesh_garg
youtube.com
IMRSS 2022 Talks