Log inSign up
Sainbayar Sukhbaatar
1,446 posts
Image
user avatar
Sainbayar Sukhbaatar
@tesatory
Memory Networks, Asymmetric Self-Play, CommNet, Adaptive-Span, System2Attention, Feedback Transformer, Multi-Token Attention
Joined May 2010
341
Following
3,233
Followers
  • user avatar
    Sainbayar Sukhbaatar
    @tesatory
    Apr 12, 2025
    Ten years ago in 2015 we published a paper called End-to-End Memory Networks (arxiv.org/abs/1503.08895). Looking back, this paper had many of the ingredients of current LLMs. Our model was the first language model that completely replaced RNN with attention. It had dot-product
    Image
    Image
    user avatar
    Andrej Karpathy
    @karpathy
    Dec 3, 2024
    The (true) story of development and inspiration behind the "attention" operator, the one in "Attention is All you Need" that introduced the Transformer. From personal email correspondence with the author @DBahdanau ~2 years ago, published here and now (with permission) following
    102K
  • user avatar
    Sainbayar Sukhbaatar
    @tesatory
    Jan 26, 2021
    We updated our Feedback Transformer paper with new experiments. Transformers fail on very simple algorithmic tasks as it is a feedforward model. A simple fix is to attend to higher-level representations (it's like remembering our past thoughts) arxiv.org/abs/2002.09402
  • user avatar
    Sainbayar Sukhbaatar
    @tesatory
    Apr 3, 2019
    Докторын диссертаци маань шагнал авч 😊
  • user avatar
    Sainbayar Sukhbaatar
    @tesatory
    Jul 10, 2019
    We released our code for adaptive-span! It can train a Transformer with a context size of 8k tokens github.com/facebookresear… #ACL2019
    Image
    GitHub - facebookresearch/adaptive-span: Transformer training code for sequential tasks
    From github.com
  • user avatar
    Sainbayar Sukhbaatar
    @tesatory
    May 17, 2021
    Our new work on "forgetting" got into ICML (long talk)! TLDR: compute a "date" for each memory, and gradually forget it when it's that date. We can see it learns to remember names (unlike me) ai.facebook.com/research/publi…
    Image
    Image
  • user avatar
    Sainbayar Sukhbaatar
    @tesatory
    Jul 20, 2021
    Just gave a talk at ICML from my home country Mongolia. Doing things remotely is amazing!
  • user avatar
    Sainbayar Sukhbaatar
    @tesatory
    May 5, 2025
    3 papers accepted to #ICLR2025 🎉 1. Thinking LLMs that trains LLM to think before answering on non-verifiable tasks. It came out before R1 and uses DPO instead of GRPO. It also doesn't use any external CoT data (arxiv.org/abs/2410.10630)
    Image
    37K
  • user avatar
    Sainbayar Sukhbaatar
    @tesatory
    Jul 2, 2020
    Гадаадад гацсан иргэдээ татах болохоор боломжгүй. Тэд гадаа хононо уу өлсөж үхнэ үү хамаагүй. Наадам, усан оргилуур болохоор боломжтой, тэрбум тэрбумаар нь цацна. Энэ хүнлэг ёс уу?
  • user avatar
    Sainbayar Sukhbaatar
    @tesatory
    Oct 15, 2024
    Thinking is an integral part of general intelligence, not just for solving math problems. We show that you can train your very own Thinking LLM easily, without human data.
    user avatar
    Jason Weston
    @jaseweston
    Oct 15, 2024
    🚨New work: Thinking LLMs!🚨 - Introduces Thought Preference Optimization (TPO) - Trains LLMs to think & respond for *all* instruction following tasks, not just math -Gives gains on AlpacaEval (beats GPT-4 & Llama3-70b) & ArenaHard with an 8B model arxiv.org/abs/2410.10630 🧵1/4
    Image
    13K
  • user avatar
    Sainbayar Sukhbaatar
    @tesatory
    Sep 17, 2024
    We have released our code for Contextual Position Encoding (CoPE) so you can try it out. Thanks @lanjanice @OlgaNLP for making it happen! github.com/facebookresear…
    user avatar
    Jason Weston
    @jaseweston
    May 30, 2024
    🚨 Contextual Position Encoding (CoPE) 🚨 Context matters! CoPE is a new positional encoding method for transformers that takes into account *context*. - Can "count" distances per head dependent on need, e.g. i-th sentence or paragraph, words, verbs, etc. Not just tokens. -
    Image
    11K
  • user avatar
    Sainbayar Sukhbaatar
    @tesatory
    Feb 23, 2024
    🎉 New paper 🎉 We teach Transformers to do A* search (I had to relearn how A* works). Then, we're curious to see if it can self-improve, and it did surprisingly well. This direction of search, plan, self-improve is very exciting!
    user avatar
    AK
    @_akhaliq
    Feb 23, 2024
    Meta presents Beyond A* Better Planning with Transformers via Search Dynamics Bootstrapping While Transformers have enabled tremendous progress in various application settings, such architectures still lag behind traditional symbolic planners for solving complex decision making
    Image
    14K
  • user avatar
    Sainbayar Sukhbaatar
    @tesatory
    Nov 21, 2023
    Attention‼️ well actually System 2 Attention. Answers from LLMs tend to be affected by its context, even if it's irrelevant. We propose more deliberate attention mechanism to solve this issue.
    user avatar
    Jason Weston
    @jaseweston
    Nov 21, 2023
    🚨 New paper! ​​🚨 We introduce System 2 Attention (S2A). - Soft attention in Transformers is susceptible to irrelevant/biased info - S2A uses LLM reasoning to generate what to attend to Improves factuality & objectivity, decreases sycophancy. arxiv.org/abs/2311.11829 🧵(1/5)
    Image
    26K
  • user avatar
    Sainbayar Sukhbaatar
    @tesatory
    Jun 30, 2025
    We find semi-online DPO working as good as GRPO!
    user avatar
    Jason Weston
    @jaseweston
    Jun 30, 2025
    🌉 Bridging Offline & Online RL for LLMs 🌉 📝: arxiv.org/abs/2506.21495 New paper shows on verifiable & non-verifiable tasks: - Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO
    Image
    9.1K
  • user avatar
    Sainbayar Sukhbaatar
    @tesatory
    Aug 24, 2019
    A blog post about our two recent papers on transformer networks is out! Of course with better graphics.
    user avatar
    AI at Meta
    Meta
    @AIatMeta
    Aug 23, 2019
    Facebook AI researchers are sharing an all-attention layer to simplify the Transformer model and an adaptive attention span method to make it more efficient. Even with a much simpler architecture, these methods match or improve state-of-the-art results. ai.facebook.com/blog/making-tr…
    Image
    GIF

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up