Katie Everett (@_katieeverett) / X

Katie Everett

89 posts

Katie Everett

@_katieeverett

Machine learning researcher @GoogleDeepMind + PhD student @MIT. Opinions are my own.

Joined August 2013

Katie Everett
@_katieeverett
May 22, 2025
1. We often observe power laws between loss and compute: loss = a * flops ^ b + c 2. Models are rapidly becoming more efficient, i.e. use less compute to reach the same loss But: which innovations actually change the exponent in the power law (b) vs change only the constant (a)?
246K
Katie Everett
@_katieeverett
May 25, 2025
There were so many great replies to this thread, let's do a Part 2! For scaling laws between loss and compute, where loss = a * flops ^ b + c, which factors change primarily the constant (a) and which factors can actually change the exponent (b)?
89K
Katie Everett
@_katieeverett
May 22, 2025
Replying to @_katieeverett
In summary, while data can change the power law exponent significantly, it's quite challenging to find architectures and optimizers that change the exponent! If you want to see something that *does* change the power law exponent, stay tuned and follow @damien_ferbach for more 😁
11K
Katie Everett
@_katieeverett
Jul 18, 2024
We've gotten some great questions about the notion of alignment in our width-scaling parameterization paper! arxiv.org/abs/2407.05872 A deep dive into the alignment metric and intuition 🧵 [1/16]
15K
Katie Everett
@_katieeverett
May 22, 2025
Replying to @_katieeverett
Sorscher et al 2022 shows that, in some settings, data pruning can not just improve the power law exponent but actually do better than power law scaling.
4.7K
Katie Everett
@_katieeverett
May 22, 2025
Replying to @_katieeverett
Finally, the data does seem to affect the exponent. Bahri et al 2021 shows theoretically that the exponent can depend on the data distribution (first image), and empirically adding different amounts of Gaussian noise to CIFAR-10 significantly changes the exponent (second image).
4.9K
Katie Everett
@_katieeverett
May 22, 2025
Replying to @_katieeverett
Hestness et al 2017 asked this question and concluded "model improvements only shift the error but do not appear to affect the power-law exponent". Are we doomed to chase constant factor improvements? Is there anything we can do to improve the exponent?
9.7K
Katie Everett
@_katieeverett
May 22, 2025
Replying to @_katieeverett
For architecture, Kaplan et al 2020 show LSTMs and Transformers have similar exponents on short contexts. Transformers handle long contexts better though.
7.1K
Katie Everett
@_katieeverett
May 22, 2025
Replying to @_katieeverett
For optimizers, Hestness et al 2017 compare SGD and Adam: "These power-law exponents are very similar despite the significant optimizer differences -- Adam appears to just shift the learning curve down by ~5% relative." (Let me know if you have other refs about optimizers!)
11K
Katie Everett
@_katieeverett
Aug 7, 2024
This post from the team at Graphcore has an excellent summary of our Scaling Exponents paper! arxiv.org/abs/2407.05872
Graphcore Research
@GCResearchTeam
Aug 6, 2024
Out latest edition of Papers of the Month is now available! This month we give our take on: Scaling Exponents, Million Expert MOE, Vocabulary Scaling Laws and RAG vs Long Contexts 🧵 graphcore-research.github.io/papers-of-the-…
6.8K
Katie Everett
@_katieeverett
May 22, 2025
Replying to @_katieeverett
Well...that was eight years ago! Let's look at some more recent papers and see how three factors affect the exponent vs the constant: * the architecture * the optimizer * the data
6.7K
Katie Everett
@_katieeverett
May 22, 2025
Replying to @_katieeverett
Shen et al 2024 compare three linear complexity architectures against Llama as a baseline. Table 3 shows all architectures have very similar exponents: Llama = -0.0798 TNL = -0.0768 HGRN2 = -0.0753 cosFormer2 = -0.0756
5.8K
Katie Everett
@_katieeverett
May 22, 2025
Replying to @_katieeverett
Brandfonbrener et al 2024 (train-to-train setting) compares training loss paired by compute for models trained on six different datasets. When kappa = 1, the two datasets have the same exponent. Their kappas are close to 1, but range between 0.88 and 1.13.
4.5K
Katie Everett
@_katieeverett
Jul 23, 2024
Come chat with me and @Locchiu at our ICML poster session 1:30-3pm CEST (Vienna time) today at Hall C 4-9 #2500 and see how our theory lets all parameterizations perform hyperparameter transfer! arxiv.org/abs/2407.05872
66K