I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning
(1/11)
I'm so excited to share my PhD thesis publicly. It's 88 pages long, I wrote it from scratch, and I tried to create a useful document for anyone wanting to gain a rigorous but unfussy understanding of the mathematics of deep learning.
It's here: arxiv.org/abs/2210.10101
(1/7)
It's really fun to think about gradient descent on an inner product space! But I actually don't think an inner product is a rich enough structure to deal with deep learning optimization
99% ML engineer i talked to don't realize gradient is actually defined via inner product. If you have a different definition of inner product (metric), you have different definition of gradient. This will free up your mind.. now preconditioning and layerwise lr tuning makes lot
We are thrilled to announce "automatic gradient descent"---a neural network optimiser without hyperparameters. AGD trains out-of-the-box and at ImageNet scale.
paper: arxiv.org/abs/2304.05187
PyTorch: github.com/jxbz/agd
1/5
I was really grateful to have the chance to speak at @Cohere_Labs and @ml_collective last week. My goal was to make the most helpful talk that I could have seen as a first-year grad student interested in neural network optimization. Sharing some info about the talk here...
(1/6)
This is a beautiful illustration of an apparent paradox in deep learning that ~the weights don't move~
I think we resolved this paradox in prior work, so I just want to share our perspective. And before you ask: yes, it's a question of norms 😅
(1/6)
If you make a drawing in the weight matrices of your neural network at initialization, it will likely still be visible at the end of training arxiv.org/abs/2012.02550
Over the past month, methods developed by myself and my collaborators were used to set new speed records for training LLMs up to 1.5B scale. I also want to help the science go faster, so now get ready for:
~The General Theory of Modular Duality~
arxiv.org/abs/2410.21265
(1/9)
I wrote this blog post that tries to go further toward design principles for neural nets and optimizers
The post presents a visual intro to optimization on normed manifolds and a Muon variant for the manifold of matrices with unit condition number
Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.
I made some visual aids about modular duality for a workshop presentation and I wanted to share them here too.
First, vanilla gradient descent does not type check! Gradients have the right units to dot product with weights but not to subtract from weights.
(1/4)
It's been wild to see our work on Muon and the anthology start to get scaled up by the big labs.
After @Kimi_Moonshot released Moonlight, people have asked whether Muon is compatible with muP. I wanted to write up an explainer, as there is something deeper going on here!
(1/8)
So Shampoo has been getting some renewed attention for winning one of the inaugural AlgoPerf challenges. I wanted to understand what the method is doing, so I employed my favourite trick of just ~directly interpreting the pseudocode~
(1/8)
I had wondered why there was no official Dion implementation by the authors... I guess now we know. This repository looks dope: FSDP Muon and Dion implementations, triton kernels for Newton-Schulz, and lots of practical advice
(1/2)