Excited to share some life updates 🥳📢:
I'll be starting as an Assistant Professor @CarnegieMellon @CMU_ECE in Fall 2023. Until then, I'll be a visiting researcher at @Meta @metaai.
I'm heading to #ICML2022 tmr!!! DM if you want to catch up 😃☕️🍱...
Beidi Chen
609 posts
Asst. Prof @CarnegieMellon, @amazon Scholar, Prev: Visiting Researcher @Meta, Postdoc @Stanford, Ph.D. @RiceUniversity, Large-Scale ML, a fan of Dota2.
Joined November 2011
- 🥳We're recruiting PhD students at CMU for Fall 2025! If you are interested in machine-learning algorithms and systems (🔑Keywords: new model arch, LLM reasoning, longcontext modeling, efficiency, etc), please mention my name in your application~ 👇 Application links: (Dec 15)
- Replying to @bubbleboi and @beidiNot Llama, 30K commits for developing MagicPig for Llama to counter NVIDIA's monopoly 😉 @chenzhuoming911 : github.com/Infini-AI-Lab/…
- 🫢 oops someone discovered our secret summer proj to counter NVIDIA's monopoly @chenzhuoming911 github.com/Infini-AI-Lab/…Most insane github I've ever seen in my life lol.
- 📢 Announcing our new speculative decoding framework Sequoia ❗️❗️❗️ It can now serve Llama2-70B on one RTX4090 with half-second/token latency (exact❗️no approximation) 🤔Sounds slow as a sloth 🦥🦥🦥??? Fun fact😛: DeepSpeed -> 5.3s / token; 8 x A100: 25ms / token (costs 8 x
GIF - Can sparse training achieve wall-clock time speed up on GPU? Yes! Simple and static #sparsity -> 2.5x faster🚀 training MLP-Mixer, ViT, and GPT-2 medium from scratch with NO drop in accuracy. arxiv.org/abs/2110.15343 (#NeurIPS2021) arxiv.org/abs/2112.00029 [1/6]
- ⏰📢After years of working on long-context efficiency, I’ve started to doubt if it’s truly necessary (Many of you have probably noticed the decline of interest in long llms). Despite strong models like Gemini, short-context + retrieval often do the trick—faster, cheaper, and🚀 RAG vs. Long-Context LLMs: The Real Battle ⚔️ 🤯Turns out, simple-to-build RAG can match million-dollar long-context LLMs (LC LLMs) on most existing benchmarks. 🤡So, do we even need long-context models? YES. Because today’s benchmarks are flawed: ⛳ Too Simple –
- ❓Wanna host a Llama2-7B-128K (14GB weight + 64GB KV cache) at home🤔 📢 Introducing TriForce! 🚀Lossless Ultra-Fast Long Seq Generation — training-free Spec Dec! 🌟 🔥 TriForce serves with 0.1s/token on 2 RTX4090s + CPU – only 2x slower on an A100 (~55ms on chip), 8x faster
GIF - 📢My group at @CMU_ECE is looking for Ph.D. students in #Algorithms #MLSys (ddl Dec 15)! Let’s shed new light on classical algorithms, make ML more accessible to the general community, and advance interdisciplinary research (science?!) together! 🙏Plz help spread the world.
- I was asked many times lately what repo to use by students who’re working on test-time scaling with slight modified attention or generation workflow (customized reward model /search). HF is a bit too time consuming esp with tons of token generation and Sglang/vllm is a bit hard🧵 Glad to introduce LiteSys the inference framework we used in📄 Kinetics: Rethinking Test-Time Scaling Laws (arxiv.org/abs/2506.05333) to evaluate test-time scaling (32K+ generated tokens) at scale. If you are: ✅ Looking for an inference framework that's easy to extend. 🐢
- 📢🔥 New off-policy RL for LLMs — now training 32B model with 200+ stale steps for the first time, while still matching on-policy accuracy 💪 A big step toward scalable & decentralized agent training 😉🤔Can we train RL on LLMs with extremely stale data? 🚀Our latest study says YES! Stale data can be as informative as on-policy data, unlocking more scalable, efficient asynchronous RL for LLMs. We introduce M2PO, an off-policy RL algorithm that keeps training stable and
- Do you know KV cache would easily take 160GB on Llama2-70B, e.g. 8K seqlen + 64batch size, even it has multi-group Attn? Come and see our preliminary work on how to use a super simple cache eviction policy to reduce this bottleneck! There’re huge opportunities in this space 🫵🏻We will present H2O tomorrow in the poster session of ES-FoMo Workshop #ICML2023 at 1:00 p.m. - 2:00 p.m. (Sat. 29 July). Please join us and chat!
- 📢 Our new work LESS leverages the observation that pretrained LLMs Attention has intrinsically sparse+lowrank structure. ☝️So at inference time, we can decompose KV Cache into constant sparse and RNN states (because lowrank attention is RNN). This also explains why the recentUpgrade your LLM KV cache eviction policy with LESS, our method to retain local and global information during generation with pretrained LLMs! Excited to share this at ICML! Paper: arxiv.org/abs/2402.09398 w/ @Xinyu2ML, @KyriectionZhang , Zhangyang Wang, Yuejie Chi, @BeidiChen
- 🤯This study explains my year-long confusion on why #GPT4 leak says OpenAI deployed speculative decoding in their serving last June by @dylan522p @SemiAnalysis_ because I thought SD is only useful for small batches... Surprisingly speculative decoding can bring more benefits whenMagicDec Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding discuss: huggingface.co/papers/2408.11… Large Language Models (LLMs) have become more prevalent in long-context applications such as interactive chatbots, document analysis, and


















