Grad (@Grad62304977) / X

Grad

3,814 posts

Grad

@Grad62304977

Joined October 2020

Grad
@Grad62304977
Sep 15, 2025
Replying to @vikhyatk
For well executed reasoning RL I would say: arxiv.org/abs/2505.22312 arxiv.org/abs/2506.13284 arxiv.org/abs/2508.06471 arxiv.org/abs/2504.13914 arxiv.org/abs/2508.08221 arxiv.org/abs/2505.08311 arxiv.org/abs/2506.13585 github.com/Tencent-Hunyua… honorable-payment-890.notion.site/POLARIS-A-POst…
arxiv.org
Skywork Open Reasoner 1 Technical Report
The success of DeepSeek-R1 underscores the significant role of reinforcement learning (RL) in enhancing the reasoning capabilities of large language models (LLMs). In this work, we present...
367K
Grad
@Grad62304977
Oct 21, 2025
Tbh I never really got 10+ year timelines. To me they just mean that we need 1 or more breakthroughs and we just assume a decade is enough to find them
Dwarkesh Patel
@dwarkesh_sp
Oct 17, 2025
The @karpathy interview 0:00:00 – AGI is still a decade away 0:30:33 – LLM cognitive deficits 0:40:53 – RL is terrible 0:50:26 – How do humans learn? 1:07:13 – AGI will blend into 2% GDP growth 1:18:24 – ASI 1:33:38 – Evolution of intelligence & culture 1:43:43 - Why self
00:00
271K
Grad
@Grad62304977
Oct 31, 2025
Rosinality
@rosinality
Oct 31, 2025
FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!
128K
Grad
@Grad62304977
Jun 23, 2025
I don’t think ppl praise OpenAI enough for their openness with o1. Of course not very open, but key details like confirming it’s just one autoregressive model generating a CoT trained with rl were really enough to understand closely how to make an o1 model, and for DeepSeek to go
110K
Grad
@Grad62304977
Jan 20, 2025
No MCTS, no PRM, emergent behaviour, simple rl
55K
Grad
@Grad62304977
Jun 3, 2025
Seems like no one saw this either, scraping arxiv manually seems to be the way. Pretty cool paper on rl for creative writing on Qwen3 32B base, and most interestingly it's one author from the Star Writing Team (haven't heard of them). They seem to have access to the 32B base tho
96K
Grad
@Grad62304977
Nov 6, 2024
43% of the speedup in the new NanoGPT record is due to a variant of value residual learning that I developed. Value residual learning (recently proposed by arxiv.org/abs/2410.17897) allows all blocks in the transformer to access the values computed by the first block. The paper
96K
Grad
@Grad62304977
Oct 23, 2025
Jumpscare for any llm RL folk
45K
Grad
@Grad62304977
Oct 30, 2025
Replying to @nrehiew_
Originally a while back i got some intuition with it from the query key value perspective which might help (theres also the gradient descent perspective which is good too). Scraped this from a chat with @stochasticchasm a year ago so might be a bit dodgy. Imagine u want to store
74K
Grad
@Grad62304977
Jul 31, 2025
Am I tripping or is the GRPO shown in the r1 paper using sequence likelihoods like GSPO (without length normalisation, which seems very important so I’m guessing they meant normal GRPO?)
35K
Grad
@Grad62304977
Oct 24, 2025
😱😱😱
24K
Grad
@Grad62304977
Sep 15, 2025
Is this the first time some of a frontier models data is released, as well as how its curated/generated? Really great work and hopefully we see more of this
Rosinality
@rosinality
Sep 15, 2025
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL Training web agents with data constructed using knowledge graphs (arxiv.org/abs/2507.02592).
38K
Grad
@Grad62304977
Sep 27, 2025
Very interesting how R1-Zero is still far ahead of the final r1 in certain benchmarks like GPQA-Diamond and CNMO. Also a GRPO clip ratio of 10 seems to pretty much confirm that they use a sequence level importance ratio as their formula shows, different to the original GRPO and
29K
Grad
@Grad62304977
Jul 12, 2025
Seen many people mention how kimi K2 for example has no CoT or thinking which isn’t true, more of an issue with terminology Main difference with reasoning models (in terms of actual functionality) is the thinking is hidden during general non-verifiable rl, so the model can
23K