1. We often observe power laws between loss and compute: loss = a * flops ^ b + c
2. Models are rapidly becoming more efficient, i.e. use less compute to reach the same loss
But: which innovations actually change the exponent in the power law (b) vs change only the constant (a)?
Joined August 2013
















