Replying to @vikhyatk
- Tbh I never really got 10+ year timelines. To me they just mean that we need 1 or more breakthroughs and we just assume a decade is enough to find themThe @karpathy interview 0:00:00 – AGI is still a decade away 0:30:33 – LLM cognitive deficits 0:40:53 – RL is terrible 0:50:26 – How do humans learn? 1:07:13 – AGI will blend into 2% GDP growth 1:18:24 – ASI 1:33:38 – Evolution of intelligence & culture 1:43:43 - Why self
00:00 - FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!
- I don’t think ppl praise OpenAI enough for their openness with o1. Of course not very open, but key details like confirming it’s just one autoregressive model generating a CoT trained with rl were really enough to understand closely how to make an o1 model, and for DeepSeek to go
- Seems like no one saw this either, scraping arxiv manually seems to be the way. Pretty cool paper on rl for creative writing on Qwen3 32B base, and most interestingly it's one author from the Star Writing Team (haven't heard of them). They seem to have access to the 32B base tho
- 43% of the speedup in the new NanoGPT record is due to a variant of value residual learning that I developed. Value residual learning (recently proposed by arxiv.org/abs/2410.17897) allows all blocks in the transformer to access the values computed by the first block. The paper
- Replying to @nrehiew_Originally a while back i got some intuition with it from the query key value perspective which might help (theres also the gradient descent perspective which is good too). Scraped this from a chat with @stochasticchasm a year ago so might be a bit dodgy. Imagine u want to store
- Am I tripping or is the GRPO shown in the r1 paper using sequence likelihoods like GSPO (without length normalisation, which seems very important so I’m guessing they meant normal GRPO?)
- Is this the first time some of a frontier models data is released, as well as how its curated/generated? Really great work and hopefully we see more of thisDeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL Training web agents with data constructed using knowledge graphs (arxiv.org/abs/2507.02592).
- Very interesting how R1-Zero is still far ahead of the final r1 in certain benchmarks like GPQA-Diamond and CNMO. Also a GRPO clip ratio of 10 seems to pretty much confirm that they use a sequence level importance ratio as their formula shows, different to the original GRPO and
- Seen many people mention how kimi K2 for example has no CoT or thinking which isn’t true, more of an issue with terminology Main difference with reasoning models (in terms of actual functionality) is the thinking is hidden during general non-verifiable rl, so the model can




















