Damien Ferbach (@damien

Damien Ferbach

103 posts

Damien Ferbach

@damien_ferbach

PhD at @Mila_Quebec Opinions my own and do not represent any affiliated organizations

Montréal, Québec

damienferbach.github.io

Joined February 2022

Pinned
Damien Ferbach
@damien_ferbach
May 26, 2025
It's very difficult to improve the *exponent* in scaling laws for loss vs compute, especially by changing the optimizer! Our new paper shows that scaling momentum correctly can *provably* improve the scaling exponent on a theoretical model. Empirically, it works on LSTMs too!
56K
Damien Ferbach
@damien_ferbach
Nov 9, 2023
Recent work by @SamuelAinsworth empirically evidences low loss linear paths between networks modulo permutations. In our work with amazing collaborators @baptistegoujaud , @gauthier_gidel , @AymericD4 , we give theoretical insights on this phenomenon! arxiv.org/pdf/2310.19103…
11K
Damien Ferbach
@damien_ferbach
Sep 25, 2024
I am delighted to share that our paper has been accepted at #NeurIPS as a spotlight!🚀 A huge thanks to my amazing collaborators @Qu3ntinB, @bose_joey and my supervisor @gauthier_gidel !!
Damien Ferbach
@damien_ferbach
Jul 24, 2024
Retraining generative models solely on their own synthetic data leads to model collapse. But what if the data was curated? With @Qu3ntinB , @bose_joey , @gauthier_gidel we show that retraining on curated data implicitly optimizes for a reward model ! 🚀 arxiv.org/pdf/2407.09499
5.9K
Damien Ferbach
@damien_ferbach
Dec 11, 2024
Come checkout our poster now!!! Poster 2510 east ballroom. @bose_joey @gauthier_gidel @Qu3ntinB @Mila_Quebec
1.8K
Damien Ferbach
@damien_ferbach
Jul 24, 2024
Retraining generative models solely on their own synthetic data leads to model collapse. But what if the data was curated? With @Qu3ntinB , @bose_joey , @gauthier_gidel we show that retraining on curated data implicitly optimizes for a reward model ! 🚀 arxiv.org/pdf/2407.09499
9.7K
Damien Ferbach
@damien_ferbach
May 26, 2025
Replying to @damien_ferbach
Title: Dimension-adapted Momentum Outscales SGD Link: arxiv.org/pdf/2505.16098 Work done with amazing collaborators @_katieeverett @gauthier_gidel @poseypaquet @cypaquette Related 🧵: x.com/_katieeverett/… x.com/_katieeverett/…
Katie Everett
@_katieeverett
May 25, 2025
There were so many great replies to this thread, let's do a Part 2! For scaling laws between loss and compute, where loss = a * flops ^ b + c, which factors change primarily the constant (a) and which factors can actually change the exponent (b)? x.com/_katieeverett/…
3K
Damien Ferbach
@damien_ferbach
May 26, 2025
Replying to @damien_ferbach
Our paper derives momentum schedules that are functions of both the model dimension and data distribution. * On our theoretical model, this provably improves the scaling law exponents in many regimes! * And, this exponent improvement holds on LSTM experiments on C4.
2.9K
Damien Ferbach
@damien_ferbach
Dec 9, 2024
I will be at NeurIPS in Vancouver this week to present our work on self-consuming generative models. arxiv.org/abs/2407.09499 Please reach out to talk about high-dimensional optimization and generative models !
arxiv.org
Self-Consuming Generative Models with Curated Data Provably...
The rapid progress in generative models has resulted in impressive leaps in generation quality, blurring the lines between synthetic and real data. Web-scale datasets are now prone to the...
1.1K
Damien Ferbach
@damien_ferbach
May 26, 2025
Replying to @damien_ferbach
🚨Takeaway: Depending on the data complexity, the compute-optimal training regime of our proposed Dimension-adapted Momentum method requires undertraining or overtraining relative to the Chinchilla law!🚨
2.1K
Damien Ferbach
@damien_ferbach
May 26, 2025
Replying to @damien_ferbach
For optimal performance, optimizer hyperparameters should be scaled quantities, including the learning rate, batch size, and even epsilon. But there is very little study about *how to scale momentum*. We often treat momentum-related hparams (e.g. beta1, beta2) like constants.
1.3K
Damien Ferbach
@damien_ferbach
May 26, 2025
Replying to @damien_ferbach
Our framework can be generalized beyond DANA-constant and DANA-decay to a whole space called DANA defined by the LR scaling gamma_3(t) = d^{-kappa_2}(1+t)^{-kappa_3}. DANA-constant, DANA-decay are extremal points of the stability boundary. DANA-decay is optimal in this class.
1.2K
Damien Ferbach
@damien_ferbach
May 26, 2025
Replying to @damien_ferbach
None of the usual suspects improve the scaling exponent on this problem: classic momentum, Nesterov, Schedule-Free SGD, and Adam do not outscale SGD. (Note especially that adaptive optimizers aren't useful on PLRF: the problem is in some sense "too simple to need Adam".)
1K
Damien Ferbach
@damien_ferbach
May 26, 2025
Replying to @damien_ferbach
Wrap-up: we proved on the PLRF model that outscaling is possible with correct HPs: ❌SGD-M scales like SGD 🟧DANA-constant outscales but needs very small LR ✅DANA-decay scales best with the proper power-law LR schedule. DANA-decay is a very promising optimizer at scale !!
931
Damien Ferbach
@damien_ferbach
Oct 15, 2024
Dreaming of pushing the boundaries in AI? 🌟 Apply for MSc or PhD at Mila, a world-leading research hub, for Fall 2025! 🚀 #AI #Research #Mila #PhD #MSc
Mila - Institut québécois d'IA
@Mila_Quebec
Oct 14, 2024
Mila's annual supervision request process opens on October 15 to receive MSc and PhD applications for Fall 2025 admission! Join our community! More information here mila.quebec/en/prospective…
649