Beidi Chen (@BeidiChen) / X

Beidi Chen

609 posts

Beidi Chen

@BeidiChen

Asst. Prof @CarnegieMellon, @amazon Scholar, Prev: Visiting Researcher @Meta, Postdoc @Stanford, Ph.D. @RiceUniversity, Large-Scale ML, a fan of Dota2.

infini-ai-lab.cmu.edu

Joined November 2011

Beidi Chen
@BeidiChen
Jul 17, 2022
Excited to share some life updates 🥳📢: I'll be starting as an Assistant Professor @CarnegieMellon @CMU_ECE in Fall 2023. Until then, I'll be a visiting researcher at @Meta @metaai. I'm heading to #ICML2022 tmr!!! DM if you want to catch up 😃☕️🍱...
Beidi Chen
@BeidiChen
Nov 7, 2024
🥳We're recruiting PhD students at CMU for Fall 2025! If you are interested in machine-learning algorithms and systems (🔑Keywords: new model arch, LLM reasoning, longcontext modeling, efficiency, etc), please mention my name in your application~ 👇 Application links: (Dec 15)
ece.cmu.edu
Admissions - Electrical and Computer Engineering - College of Engineering - Carnegie Mellon...
Review the requirements for admission into Carnegie Mellon’s Department of Electrical and Computer Engineering. We encourage prospective students to visit campus and tour the department.
123K
Beidi Chen
@BeidiChen
Nov 27, 2024
Replying to @bubbleboi and @beidi
Not Llama, 30K commits for developing MagicPig for Llama to counter NVIDIA's monopoly 😉 @chenzhuoming911 : github.com/Infini-AI-Lab/…
23K
Beidi Chen
@BeidiChen
Nov 27, 2024
🫢 oops someone discovered our secret summer proj to counter NVIDIA's monopoly @chenzhuoming911 github.com/Infini-AI-Lab/…
bubble boi
@bubbleboi
Nov 27, 2024
Most insane github I've ever seen in my life lol.
91K
Beidi Chen
@BeidiChen
Mar 13, 2024
📢 Announcing our new speculative decoding framework Sequoia ❗️❗️❗️ It can now serve Llama2-70B on one RTX4090 with half-second/token latency (exact❗️no approximation) 🤔Sounds slow as a sloth 🦥🦥🦥??? Fun fact😛: DeepSpeed -> 5.3s / token; 8 x A100: 25ms / token (costs 8 x
GIF
104K
Beidi Chen
@BeidiChen
Dec 10, 2021
Can sparse training achieve wall-clock time speed up on GPU? Yes! Simple and static #sparsity -> 2.5x faster🚀 training MLP-Mixer, ViT, and GPT-2 medium from scratch with NO drop in accuracy. arxiv.org/abs/2110.15343 (#NeurIPS2021) arxiv.org/abs/2112.00029 [1/6]
Beidi Chen
@BeidiChen
Feb 14, 2025
⏰📢After years of working on long-context efficiency, I’ve started to doubt if it’s truly necessary (Many of you have probably noticed the decline of interest in long llms). Despite strong models like Gemini, short-context + retrieval often do the trick—faster, cheaper, and
Infini-AI-Lab
@InfiniAILab
Feb 14, 2025
🚀 RAG vs. Long-Context LLMs: The Real Battle ⚔️ 🤯Turns out, simple-to-build RAG can match million-dollar long-context LLMs (LC LLMs) on most existing benchmarks. 🤡So, do we even need long-context models? YES. Because today’s benchmarks are flawed: ⛳ Too Simple –
65K
Beidi Chen
@BeidiChen
Apr 24, 2024
❓Wanna host a Llama2-7B-128K (14GB weight + 64GB KV cache) at home🤔 📢 Introducing TriForce! 🚀Lossless Ultra-Fast Long Seq Generation — training-free Spec Dec! 🌟 🔥 TriForce serves with 0.1s/token on 2 RTX4090s + CPU – only 2x slower on an A100 (~55ms on chip), 8x faster
GIF
53K
Beidi Chen
@BeidiChen
Dec 10, 2022
📢My group at @CMU_ECE is looking for Ph.D. students in #Algorithms #MLSys (ddl Dec 15)! Let’s shed new light on classical algorithms, make ML more accessible to the general community, and advance interdisciplinary research (science?!) together! 🙏Plz help spread the world.
Beidi Chen
@BeidiChen
Jul 10, 2025
I was asked many times lately what repo to use by students who’re working on test-time scaling with slight modified attention or generation workflow (customized reward model /search). HF is a bit too time consuming esp with tons of token generation and Sglang/vllm is a bit hard
Infini-AI-Lab
@InfiniAILab
Jul 10, 2025
🧵 Glad to introduce LiteSys the inference framework we used in📄 Kinetics: Rethinking Test-Time Scaling Laws (arxiv.org/abs/2506.05333) to evaluate test-time scaling (32K+ generated tokens) at scale. If you are: ✅ Looking for an inference framework that's easy to extend. 🐢
29K
Beidi Chen
@BeidiChen
Oct 7, 2025
📢🔥 New off-policy RL for LLMs — now training 32B model with 200+ stale steps for the first time, while still matching on-policy accuracy 💪 A big step toward scalable & decentralized agent training 😉
Infini-AI-Lab
@InfiniAILab
Oct 7, 2025
🤔Can we train RL on LLMs with extremely stale data? 🚀Our latest study says YES! Stale data can be as informative as on-policy data, unlocking more scalable, efficient asynchronous RL for LLMs. We introduce M2PO, an off-policy RL algorithm that keeps training stable and
28K
Beidi Chen
@BeidiChen
Jul 29, 2023
Do you know KV cache would easily take 160GB on Llama2-70B, e.g. 8K seqlen + 64batch size, even it has multi-group Attn? Come and see our preliminary work on how to use a super simple cache eviction policy to reduce this bottleneck! There’re huge opportunities in this space 🫵🏻
Zhenyu (Allen) Zhang
@KyriectionZhang
Jul 29, 2023
We will present H2O tomorrow in the poster session of ES-FoMo Workshop #ICML2023 at 1:00 p.m. - 2:00 p.m. (Sat. 29 July). Please join us and chat!
71K
Beidi Chen
@BeidiChen
May 3, 2024
📢 Our new work LESS leverages the observation that pretrained LLMs Attention has intrinsically sparse+lowrank structure. ☝️So at inference time, we can decompose KV Cache into constant sparse and RNN states (because lowrank attention is RNN). This also explains why the recent
Harry Dong
@Real_HDong
May 3, 2024
Upgrade your LLM KV cache eviction policy with LESS, our method to retain local and global information during generation with pretrained LLMs! Excited to share this at ICML! Paper: arxiv.org/abs/2402.09398 w/ @Xinyu2ML, @KyriectionZhang , Zhangyang Wang, Yuejie Chi, @BeidiChen
46K
Beidi Chen
@BeidiChen
Aug 21, 2024
🤯This study explains my year-long confusion on why #GPT4 leak says OpenAI deployed speculative decoding in their serving last June by @dylan522p @SemiAnalysis_ because I thought SD is only useful for small batches... Surprisingly speculative decoding can bring more benefits when
AK
@_akhaliq
Aug 21, 2024
MagicDec Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding discuss: huggingface.co/papers/2408.11… Large Language Models (LLMs) have become more prevalent in long-context applications such as interactive chatbots, document analysis, and
31K