Problem: we wanted to find what's the maximum number of tokens we can serve with per megawatt.
Solution: implement energy LLM inference efficient kernels.
Our Solution:
- Create an eval environment for LLM inference kernel research. It works with any LLM and hardware (We used Qwen 3 on B300).
- Create an autoresearch harness, that optimizes our research environment.
- Train our own RL models on our autoeval solution.
findings: We (or our agents) found kernels that are 7% more energy efficient for full LLM inference + 30% more efficient on kernelbench.
Log in or sign up for Devpost to join the conversation.