Log inSign up
Danny Driess
169 posts
user avatar
Danny Driess
@DannyDriess
Research Scientist @physical_int. Formerly Google DeepMind
dannydriess.github.io
Joined August 2021
337
Following
4,025
Followers
  • Pinned
    user avatar
    Danny Driess
    @DannyDriess
    May 28, 2025
    How to build vision-language-action models that train fast, run fast & generalize? In our new paper, we formalize & analyze the approach of our π-0.5 model & further improve it with a single stage recipe. Blog: pi.website/research/knowl… Paper: pi.website/download/pi05_…
    Image
    00:00
    20K
  • user avatar
    Danny Driess
    @DannyDriess
    Mar 7, 2023
    What happens when we train the largest vision-language model and add in robot experiences? The result is PaLM-E 🌴🤖, a 562-billion parameter, general-purpose, embodied visual-language generalist - across robotics, vision, and language. Website: palm-e.github.io
    Image
    00:00
    1.3M
  • user avatar
    Danny Driess
    @DannyDriess
    Oct 31, 2024
    Look at what our novel Vision-Language-Action Flow Model can do! We @physical_int developed an architecture that bridges the gap between internet-scale pre-training and robot dexterity
    Image
    00:00
    191K
  • user avatar
    Danny Driess
    @DannyDriess
    Mar 7, 2023
    Replying to @DannyDriess
    PaLM-E enables robot planning directly from pixels – all in a single model, trained end-to-end. Here the model is guiding a robot to get a chip bag from a kitchen. Being integrated into the control loop, PaLM-E is robust to disturbances happening during the robot’s journey.
    Image
    00:00
    31K
  • user avatar
    Danny Driess
    @DannyDriess
    Mar 7, 2023
    Replying to @DannyDriess
    Perhaps most exciting about PaLM-E is **positive transfer**: simultaneously training PaLM-E across several domains, including internet-scale general vision-language tasks, leads to significantly higher performance compared to single-task robot models.
    Image
    138K
  • user avatar
    Danny Driess
    @DannyDriess
    Jan 16, 2025
    Most large models are built around transformers and tokenizers. We @physical_int figured out how to come up with a robot action tokenizer that enables to train auto regressive models on dexterous robot tasks and is 5x more efficient! pi.website/research/fast
    Image
    11K
  • user avatar
    Danny Driess
    @DannyDriess
    Mar 7, 2023
    Replying to @DannyDriess
    We observe a notable trend with model scale: the larger the language model, the more it maintains its language capabilities when training on visual-language and robotics tasks – quantitatively, the 562B PaLM-E model nearly retains all of its language capabilities.
    Image
    28K
  • user avatar
    Danny Driess
    @DannyDriess
    Mar 7, 2023
    Replying to @DannyDriess
    PaLM-E is the largest VLM reported to date. We observe emergent capabilities like multimodal chain of thought reasoning, and multi-image inference, despite being trained on only single-image prompts. Though not the focus of our work, PaLM-E sets a new SOTA on OK-VQA benchmark.
    Image
    00:00
    12K
  • user avatar
    Danny Driess
    @DannyDriess
    Mar 7, 2023
    Replying to @DannyDriess
    In a different domain, here the **same** exact PaLM-E model is controlling a robot in real-time. This robot recently required human assistance to guide it through very long-horizon tasks (interactive-language.github.io), but now PaLM-E can learn these tasks autonomously.
    Image
    00:00
    13K
  • user avatar
    Danny Driess
    @DannyDriess
    Jul 28, 2023
    Vision-Language-Action models are out! We introduce RT-2, where we train (large) vision-language models on both robot actions and internet-scale general vision-language tasks simultaneously! deepmind.com/blog/rt-2-new-…
    Image
    GIF
    9.8K
  • user avatar
    Danny Driess
    @DannyDriess
    Mar 7, 2023
    Replying to @DannyDriess
    And that’s PaLM-E: a one-model generalist across robotics, language, and vision-language.
    Image
    11K
  • user avatar
    Danny Driess
    @DannyDriess
    Mar 9, 2022
    Excited to share our preprint "Learning Multi-Object Dynamics with Compositional Neural Radiance Fields" Paper: arxiv.org/abs/2202.11855 Videos: dannydriess.github.io/compnerfdyn/ Amazing collaboration between @DannyDriess, @YunzhuLiYZ, @huang_zhiao, Russ Tedrake, Marc Toussaint. (1/7)
    Image
    00:00
  • user avatar
    Danny Driess
    @DannyDriess
    Mar 7, 2023
    Replying to @DannyDriess
    The inputs to PaLM-E are multimodal sentences that interleave text, images, states, or other continuous encodings. These multimodal sentences are passed as inputs to an LLM for next token prediction, trained end-to-end. For PaLM-E-562B, it combines PaLM-540B and ViT-22B.
    Image
    11K
  • user avatar
    Danny Driess
    @DannyDriess
    Sep 12, 2022
    The recordings of our RSS 2022 Workshop on Implicit Representations for Robotic Manipulation are now online! Playlist: youtube.com/playlist?list=… Web: imrss2022.github.io Thanks again to our speakers @DanicaKragic @vincesitzmann @andyzengtweets @IMordatch @animesh_garg
    Image
    youtube.com
    IMRSS 2022 Talks

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up