Jeremy Bernstein (@jxbz) / X

Jeremy Bernstein

1,050 posts

Jeremy Bernstein

@jxbz

🧪 @thinkymachines ✍️ anon feedback @ admonymous.co/jxbz

🌉

Joined January 2010

Pinned
Jeremy Bernstein
@jxbz
Mar 7, 2025
I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning (1/11)
126K
Jeremy Bernstein
@jxbz
Oct 24, 2022
I'm so excited to share my PhD thesis publicly. It's 88 pages long, I wrote it from scratch, and I tried to create a useful document for anyone wanting to gain a rigorous but unfussy understanding of the mathematics of deep learning. It's here: arxiv.org/abs/2210.10101 (1/7)
Jeremy Bernstein
@jxbz
Sep 29, 2024
It's really fun to think about gradient descent on an inner product space! But I actually don't think an inner product is a rich enough structure to deal with deep learning optimization
Simo Ryu
@cloneofsimo
Sep 27, 2024
99% ML engineer i talked to don't realize gradient is actually defined via inner product. If you have a different definition of inner product (metric), you have different definition of gradient. This will free up your mind.. now preconditioning and layerwise lr tuning makes lot
170K
Jeremy Bernstein
@jxbz
Apr 14, 2023
We are thrilled to announce "automatic gradient descent"---a neural network optimiser without hyperparameters. AGD trains out-of-the-box and at ImageNet scale. paper: arxiv.org/abs/2304.05187 PyTorch: github.com/jxbz/agd 1/5
arxiv.org
Automatic Gradient Descent: Deep Learning without Hyperparameters
The architecture of a deep neural network is defined explicitly in terms of the number of layers, the width of each layer and the general network topology. Existing optimisation frameworks neglect...
146K
Jeremy Bernstein
@jxbz
May 13, 2025
I was really grateful to have the chance to speak at @Cohere_Labs and @ml_collective last week. My goal was to make the most helpful talk that I could have seen as a first-year grad student interested in neural network optimization. Sharing some info about the talk here... (1/6)
55K
Jeremy Bernstein
@jxbz
Oct 12, 2024
This is a beautiful illustration of an apparent paradox in deep learning that ~the weights don't move~ I think we resolved this paradox in prior work, so I just want to share our perspective. And before you ask: yes, it's a question of norms 😅 (1/6)
Nora Belrose
@norabelrose
Oct 11, 2024
If you make a drawing in the weight matrices of your neural network at initialization, it will likely still be visible at the end of training arxiv.org/abs/2012.02550
100K
Jeremy Bernstein
@jxbz
Oct 29, 2024
Over the past month, methods developed by myself and my collaborators were used to set new speed records for training LLMs up to 1.5B scale. I also want to help the science go faster, so now get ready for: ~The General Theory of Modular Duality~ arxiv.org/abs/2410.21265 (1/9)
arxiv.org
Modular Duality in Deep Learning
An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside....
60K
Jeremy Bernstein
@jxbz
Sep 26, 2025
I wrote this blog post that tries to go further toward design principles for neural nets and optimizers The post presents a visual intro to optimization on normed manifolds and a Muon variant for the manifold of matrices with unit condition number
Thinking Machines
@thinkymachines
Sep 26, 2025
Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.
65K
Jeremy Bernstein
@jxbz
Nov 5, 2024
I made some visual aids about modular duality for a workshop presentation and I wanted to share them here too. First, vanilla gradient descent does not type check! Gradients have the right units to dot product with weights but not to subtract from weights. (1/4)
52K
Jeremy Bernstein
@jxbz
Mar 1, 2025
It's been wild to see our work on Muon and the anthology start to get scaled up by the big labs. After @Kimi_Moonshot released Moonlight, people have asked whether Muon is compatible with muP. I wanted to write up an explainer, as there is something deeper going on here! (1/8)
69K
Jeremy Bernstein
@jxbz
Aug 3, 2024
So Shampoo has been getting some renewed attention for winning one of the inaugural AlgoPerf challenges. I wanted to understand what the method is doing, so I employed my favourite trick of just ~directly interpreting the pseudocode~ (1/8)
97K
Jeremy Bernstein
@jxbz
Jun 15, 2020
We characterised the optimisation landscape of neural networks, and the result is you don't have to tune the learning rate any more. Joint with @ArashVahdat, @yisongyue, and @liu_mingyu. paper: arxiv.org/abs/2002.03432 blog: jeremybernste.in/blog/getting-t… code: github.com/jxbz/fromage
Jeremy Bernstein
@jxbz
Aug 3, 2025
I had wondered why there was no official Dion implementation by the authors... I guess now we know. This repository looks dope: FSDP Muon and Dion implementations, triton kernels for Newton-Schulz, and lots of practical advice (1/2)
75K
Jeremy Bernstein
@jxbz
Jul 12, 2020
Madam (multiplicative adam) needs little to no learning rate tuning and brings the numerical representation of the synapse closer to neuroscience. w/ @jiawzhao, @mameister4, @liu_mingyu, @AnimaAnandkumar & @yisongyue paper: arxiv.org/abs/2006.14560 code: github.com/jxbz/madam