Zyphra is proud to release Tree Attention, a fast inference method for extremely large sequence lengths
• 8x faster inference speed vs. Ring Attention
• 2x less peak memory
• low data communication volumes
Paper: arxiv.org/abs/2408.04093
Code: github.com/Zyphra/tree_at…
A 🧵
Then, did you happen to recall that thanks to automatic differentiation (timvieira.github.io/blog/post/2016…) the time complexity to compute the gradient of a function is roughly equivalent to the complexity of computing the function itself?
Finally managed retire early from my professional physics research career (I know, I know, in time it would have retired me). Eagerly looking forward to going full crackpot as an amateur.
Consequently, did you realize that the efficient tree reductions here can be done on a DGX cluster via NCCL Allreduce in a topology-aware, efficiently overlapped manner?
OK that was a fun 15 minutes of twitter fame. Now for the downfall - the tweet at the top's got an embarrassing typo, bonus points for whoever catches it
Thanks for sharing! Another little trick that might amuse you is that we identified a function which upon minimization produces the forward pass of the attention block: