Log inSign up
Katie Everett
89 posts
user avatar
Katie Everett
@_katieeverett
Machine learning researcher @GoogleDeepMind + PhD student @MIT. Opinions are my own.
Joined August 2013
631
Following
2,630
Followers

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
  • user avatar
    Katie Everett
    @_katieeverett
    May 22, 2025
    1. We often observe power laws between loss and compute: loss = a * flops ^ b + c 2. Models are rapidly becoming more efficient, i.e. use less compute to reach the same loss But: which innovations actually change the exponent in the power law (b) vs change only the constant (a)?
    246K
  • user avatar
    Katie Everett
    @_katieeverett
    May 25, 2025
    There were so many great replies to this thread, let's do a Part 2! For scaling laws between loss and compute, where loss = a * flops ^ b + c, which factors change primarily the constant (a) and which factors can actually change the exponent (b)?
    89K
  • user avatar
    Katie Everett
    @_katieeverett
    May 22, 2025
    Replying to @_katieeverett
    In summary, while data can change the power law exponent significantly, it's quite challenging to find architectures and optimizers that change the exponent! If you want to see something that *does* change the power law exponent, stay tuned and follow @damien_ferbach for more 😁
    11K
  • user avatar
    Katie Everett
    @_katieeverett
    Jul 18, 2024
    We've gotten some great questions about the notion of alignment in our width-scaling parameterization paper! arxiv.org/abs/2407.05872 A deep dive into the alignment metric and intuition 🧵 [1/16]
    Image
    Image
    15K
  • user avatar
    Katie Everett
    @_katieeverett
    May 22, 2025
    Replying to @_katieeverett
    Sorscher et al 2022 shows that, in some settings, data pruning can not just improve the power law exponent but actually do better than power law scaling.
    Image
    4.7K
  • user avatar
    Katie Everett
    @_katieeverett
    May 22, 2025
    Replying to @_katieeverett
    Finally, the data does seem to affect the exponent. Bahri et al 2021 shows theoretically that the exponent can depend on the data distribution (first image), and empirically adding different amounts of Gaussian noise to CIFAR-10 significantly changes the exponent (second image).
    Image
    Image
    4.9K
  • user avatar
    Katie Everett
    @_katieeverett
    May 22, 2025
    Replying to @_katieeverett
    Hestness et al 2017 asked this question and concluded "model improvements only shift the error but do not appear to affect the power-law exponent". Are we doomed to chase constant factor improvements? Is there anything we can do to improve the exponent?
    Image
    9.7K
  • user avatar
    Katie Everett
    @_katieeverett
    May 22, 2025
    Replying to @_katieeverett
    For architecture, Kaplan et al 2020 show LSTMs and Transformers have similar exponents on short contexts. Transformers handle long contexts better though.
    Image
    7.1K
  • user avatar
    Katie Everett
    @_katieeverett
    May 22, 2025
    Replying to @_katieeverett
    For optimizers, Hestness et al 2017 compare SGD and Adam: "These power-law exponents are very similar despite the significant optimizer differences -- Adam appears to just shift the learning curve down by ~5% relative." (Let me know if you have other refs about optimizers!)
    Image
    11K
  • user avatar
    Katie Everett
    @_katieeverett
    Aug 7, 2024
    This post from the team at Graphcore has an excellent summary of our Scaling Exponents paper! arxiv.org/abs/2407.05872
    Image
    user avatar
    Graphcore Research
    @GCResearchTeam
    Aug 6, 2024
    Out latest edition of Papers of the Month is now available! This month we give our take on: Scaling Exponents, Million Expert MOE, Vocabulary Scaling Laws and RAG vs Long Contexts 🧵 graphcore-research.github.io/papers-of-the-…
    6.8K
  • user avatar
    Katie Everett
    @_katieeverett
    May 22, 2025
    Replying to @_katieeverett
    Well...that was eight years ago! Let's look at some more recent papers and see how three factors affect the exponent vs the constant: * the architecture * the optimizer * the data
    6.7K
  • user avatar
    Katie Everett
    @_katieeverett
    May 22, 2025
    Replying to @_katieeverett
    Shen et al 2024 compare three linear complexity architectures against Llama as a baseline. Table 3 shows all architectures have very similar exponents: Llama = -0.0798 TNL = -0.0768 HGRN2 = -0.0753 cosFormer2 = -0.0756
    Image
    Image
    5.8K
  • user avatar
    Katie Everett
    @_katieeverett
    May 22, 2025
    Replying to @_katieeverett
    Brandfonbrener et al 2024 (train-to-train setting) compares training loss paired by compute for models trained on six different datasets. When kappa = 1, the two datasets have the same exponent. Their kappas are close to 1, but range between 0.88 and 1.13.
    Image
    Image
    4.5K
  • user avatar
    Katie Everett
    @_katieeverett
    Jul 23, 2024
    Come chat with me and @Locchiu at our ICML poster session 1:30-3pm CEST (Vienna time) today at Hall C 4-9 #2500 and see how our theory lets all parameterizations perform hyperparameter transfer! arxiv.org/abs/2407.05872
    Image
    66K
This post is unavailable.