Log inSign up
Jeremy Bernstein
1,050 posts
Image
user avatar
Jeremy Bernstein
@jxbz
🧪 @thinkymachines ✍️ anon feedback @ admonymous.co/jxbz
🌉
jeremybernste.in
Joined January 2010
692
Following
7,859
Followers
  • Pinned
    user avatar
    Jeremy Bernstein
    @jxbz
    Mar 7, 2025
    I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning (1/11)
    Image
    126K
  • user avatar
    Jeremy Bernstein
    @jxbz
    Oct 24, 2022
    I'm so excited to share my PhD thesis publicly. It's 88 pages long, I wrote it from scratch, and I tried to create a useful document for anyone wanting to gain a rigorous but unfussy understanding of the mathematics of deep learning. It's here: arxiv.org/abs/2210.10101 (1/7)
    Image
  • user avatar
    Jeremy Bernstein
    @jxbz
    Sep 29, 2024
    It's really fun to think about gradient descent on an inner product space! But I actually don't think an inner product is a rich enough structure to deal with deep learning optimization
    Image
    user avatar
    Simo Ryu
    @cloneofsimo
    Sep 27, 2024
    99% ML engineer i talked to don't realize gradient is actually defined via inner product. If you have a different definition of inner product (metric), you have different definition of gradient. This will free up your mind.. now preconditioning and layerwise lr tuning makes lot
    170K
  • user avatar
    Jeremy Bernstein
    @jxbz
    Apr 14, 2023
    We are thrilled to announce "automatic gradient descent"---a neural network optimiser without hyperparameters. AGD trains out-of-the-box and at ImageNet scale. paper: arxiv.org/abs/2304.05187 PyTorch: github.com/jxbz/agd 1/5
    arXiv logo
    arxiv.org
    Automatic Gradient Descent: Deep Learning without Hyperparameters
    The architecture of a deep neural network is defined explicitly in terms of the number of layers, the width of each layer and the general network topology. Existing optimisation frameworks neglect...
    146K
  • user avatar
    Jeremy Bernstein
    @jxbz
    May 13, 2025
    I was really grateful to have the chance to speak at @Cohere_Labs and @ml_collective last week. My goal was to make the most helpful talk that I could have seen as a first-year grad student interested in neural network optimization. Sharing some info about the talk here... (1/6)
    Image
    55K
  • user avatar
    Jeremy Bernstein
    @jxbz
    Oct 12, 2024
    This is a beautiful illustration of an apparent paradox in deep learning that ~the weights don't move~ I think we resolved this paradox in prior work, so I just want to share our perspective. And before you ask: yes, it's a question of norms 😅 (1/6)
    user avatar
    Nora Belrose
    @norabelrose
    Oct 11, 2024
    If you make a drawing in the weight matrices of your neural network at initialization, it will likely still be visible at the end of training arxiv.org/abs/2012.02550
    Image
    100K
  • user avatar
    Jeremy Bernstein
    @jxbz
    Oct 29, 2024
    Over the past month, methods developed by myself and my collaborators were used to set new speed records for training LLMs up to 1.5B scale. I also want to help the science go faster, so now get ready for: ~The General Theory of Modular Duality~ arxiv.org/abs/2410.21265 (1/9)
    arXiv logo
    arxiv.org
    Modular Duality in Deep Learning
    An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside....
    60K
  • user avatar
    Jeremy Bernstein
    @jxbz
    Sep 26, 2025
    I wrote this blog post that tries to go further toward design principles for neural nets and optimizers The post presents a visual intro to optimization on normed manifolds and a Muon variant for the manifold of matrices with unit condition number
    user avatar
    Thinking Machines
    @thinkymachines
    Sep 26, 2025
    Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.
    Image
    65K
  • user avatar
    Jeremy Bernstein
    @jxbz
    Nov 5, 2024
    I made some visual aids about modular duality for a workshop presentation and I wanted to share them here too. First, vanilla gradient descent does not type check! Gradients have the right units to dot product with weights but not to subtract from weights. (1/4)
    Image
    52K
  • user avatar
    Jeremy Bernstein
    @jxbz
    Mar 1, 2025
    It's been wild to see our work on Muon and the anthology start to get scaled up by the big labs. After @Kimi_Moonshot released Moonlight, people have asked whether Muon is compatible with muP. I wanted to write up an explainer, as there is something deeper going on here! (1/8)
    Image
    69K
  • user avatar
    Jeremy Bernstein
    @jxbz
    Aug 3, 2024
    So Shampoo has been getting some renewed attention for winning one of the inaugural AlgoPerf challenges. I wanted to understand what the method is doing, so I employed my favourite trick of just ~directly interpreting the pseudocode~ (1/8)
    Image
    97K
  • user avatar
    Jeremy Bernstein
    @jxbz
    Jun 15, 2020
    We characterised the optimisation landscape of neural networks, and the result is you don't have to tune the learning rate any more. Joint with @ArashVahdat, @yisongyue, and @liu_mingyu. paper: arxiv.org/abs/2002.03432 blog: jeremybernste.in/blog/getting-t… code: github.com/jxbz/fromage
    How does local knowledge about a function break down? In optimisation theory it is standard to assume that local knowledge breaks down as a quadratic function of parameter distance. We visualise this assumption on the left. Since neural networks are compositional, parameters in different layers interact with each other. This means that a quadratic breakdown in unrealistic. Our proposed trust region---deep relative trust---is visualised on the right.
  • user avatar
    Jeremy Bernstein
    @jxbz
    Aug 3, 2025
    I had wondered why there was no official Dion implementation by the authors... I guess now we know. This repository looks dope: FSDP Muon and Dion implementations, triton kernels for Newton-Schulz, and lots of practical advice (1/2)
    75K
  • user avatar
    Jeremy Bernstein
    @jxbz
    Jul 12, 2020
    Madam (multiplicative adam) needs little to no learning rate tuning and brings the numerical representation of the synapse closer to neuroscience. w/ @jiawzhao, @mameister4, @liu_mingyu, @AnimaAnandkumar & @yisongyue paper: arxiv.org/abs/2006.14560 code: github.com/jxbz/madam
    Visualising proposed number systems in neuroscience and engineering.

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up