Log inSign up
Grad
Prime Intellect
3,814 posts
user avatar
Grad
Prime Intellect
@Grad62304977
Joined October 2020
2,761
Following
9,206
Followers
  • user avatar
    Grad
    Prime Intellect
    @Grad62304977
    Sep 15, 2025
    Replying to @vikhyatk
    For well executed reasoning RL I would say: arxiv.org/abs/2505.22312 arxiv.org/abs/2506.13284 arxiv.org/abs/2508.06471 arxiv.org/abs/2504.13914 arxiv.org/abs/2508.08221 arxiv.org/abs/2505.08311 arxiv.org/abs/2506.13585 github.com/Tencent-Hunyua… honorable-payment-890.notion.site/POLARIS-A-POst…
    arXiv logo
    arxiv.org
    Skywork Open Reasoner 1 Technical Report
    The success of DeepSeek-R1 underscores the significant role of reinforcement learning (RL) in enhancing the reasoning capabilities of large language models (LLMs). In this work, we present...
    367K
  • user avatar
    Grad
    Prime Intellect
    @Grad62304977
    Oct 21, 2025
    Tbh I never really got 10+ year timelines. To me they just mean that we need 1 or more breakthroughs and we just assume a decade is enough to find them
    user avatar
    Dwarkesh Patel
    @dwarkesh_sp
    Oct 17, 2025
    The @karpathy interview 0:00:00 – AGI is still a decade away 0:30:33 – LLM cognitive deficits 0:40:53 – RL is terrible 0:50:26 – How do humans learn? 1:07:13 – AGI will blend into 2% GDP growth 1:18:24 – ASI 1:33:38 – Evolution of intelligence & culture 1:43:43 - Why self
    Image
    00:00
    271K
  • user avatar
    Grad
    Prime Intellect
    @Grad62304977
    Oct 31, 2025
    Image
    Image
    user avatar
    Rosinality
    @rosinality
    Oct 31, 2025
    FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!
    128K
  • user avatar
    Grad
    Prime Intellect
    @Grad62304977
    Jun 23, 2025
    I don’t think ppl praise OpenAI enough for their openness with o1. Of course not very open, but key details like confirming it’s just one autoregressive model generating a CoT trained with rl were really enough to understand closely how to make an o1 model, and for DeepSeek to go
    110K
  • user avatar
    Grad
    Prime Intellect
    @Grad62304977
    Jan 20, 2025
    No MCTS, no PRM, emergent behaviour, simple rl
    Image
    55K
  • user avatar
    Grad
    Prime Intellect
    @Grad62304977
    Jun 3, 2025
    Seems like no one saw this either, scraping arxiv manually seems to be the way. Pretty cool paper on rl for creative writing on Qwen3 32B base, and most interestingly it's one author from the Star Writing Team (haven't heard of them). They seem to have access to the 32B base tho
    Image
    Image
    96K
  • user avatar
    Grad
    Prime Intellect
    @Grad62304977
    Nov 6, 2024
    43% of the speedup in the new NanoGPT record is due to a variant of value residual learning that I developed. Value residual learning (recently proposed by arxiv.org/abs/2410.17897) allows all blocks in the transformer to access the values computed by the first block. The paper
    Image
    Image
    96K
  • user avatar
    Grad
    Prime Intellect
    @Grad62304977
    Oct 23, 2025
    Jumpscare for any llm RL folk
    Image
    45K
  • user avatar
    Grad
    Prime Intellect
    @Grad62304977
    Oct 30, 2025
    Replying to @nrehiew_
    Originally a while back i got some intuition with it from the query key value perspective which might help (theres also the gradient descent perspective which is good too). Scraped this from a chat with @stochasticchasm a year ago so might be a bit dodgy. Imagine u want to store
    74K
  • user avatar
    Grad
    Prime Intellect
    @Grad62304977
    Jul 31, 2025
    Am I tripping or is the GRPO shown in the r1 paper using sequence likelihoods like GSPO (without length normalisation, which seems very important so I’m guessing they meant normal GRPO?)
    Image
    35K
  • user avatar
    Grad
    Prime Intellect
    @Grad62304977
    Oct 24, 2025
    😱😱😱
    Image
    24K
  • user avatar
    Grad
    Prime Intellect
    @Grad62304977
    Sep 15, 2025
    Is this the first time some of a frontier models data is released, as well as how its curated/generated? Really great work and hopefully we see more of this
    Image
    Image
    Image
    Image
    user avatar
    Rosinality
    @rosinality
    Sep 15, 2025
    DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL Training web agents with data constructed using knowledge graphs (arxiv.org/abs/2507.02592).
    38K
  • user avatar
    Grad
    Prime Intellect
    @Grad62304977
    Sep 27, 2025
    Very interesting how R1-Zero is still far ahead of the final r1 in certain benchmarks like GPQA-Diamond and CNMO. Also a GRPO clip ratio of 10 seems to pretty much confirm that they use a sequence level importance ratio as their formula shows, different to the original GRPO and
    Image
    Image
    29K
  • user avatar
    Grad
    Prime Intellect
    @Grad62304977
    Jul 12, 2025
    Seen many people mention how kimi K2 for example has no CoT or thinking which isn’t true, more of an issue with terminology Main difference with reasoning models (in terms of actual functionality) is the thinking is hidden during general non-verifiable rl, so the model can
    23K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up