Log inSign up
Tim Dettmers
3,840 posts
Image
user avatar
Tim Dettmers
@Tim_Dettmers
Creator of bitsandbytes. Professor @CarnegieMellon and Research Scientist @allen_ai . I blog about deep learning and PhD life at timdettmers.com.
Pittsburgh, PA
timdettmers.com/about
Joined October 2012
904
Following
45.7K
Followers
  • Pinned
    user avatar
    Tim Dettmers
    @Tim_Dettmers
    Jul 30, 2024
    After 7 months on the job market, I am happy to announce: - I joined @allen_ai - Professor at @CarnegieMellon from Fall 2025 - New bitsandbytes maintainer @Titus_vK My main focus will be to strengthen open-source for real-world problems and bring the best AI to laptops 🧵
    258K
  • user avatar
    Tim Dettmers
    @Tim_Dettmers
    May 24, 2023
    QLoRA: 4-bit finetuning of LLMs is here! With it comes Guanaco, a chatbot on a single GPU, achieving 99% ChatGPT performance on the Vicuna benchmark: Paper: arxiv.org/abs/2305.14314 Code+Demo: github.com/artidoro/qlora Samples: colab.research.google.com/drive/1kK6xasH… Colab: colab.research.google.com/drive/17XEqL1J…
    Image
    1.6M
  • user avatar
    Tim Dettmers
    @Tim_Dettmers
    Nov 12, 2024
    This is the most important paper in a long time . It shows with strong evidence we are reaching the limits of quantization. The paper says this: the more tokens you train on, the more precision you need. This has broad implications for the entire field and the future of GPUs🧵
    Image
    Image
    user avatar
    Tanishq Kumar
    @tanishqkumar07
    Nov 11, 2024
    [1/7] New paper alert! Heard about the BitNet hype or that Llama-3 is harder to quantize? Our new work studies both! We formulate scaling laws for precision, across both pre and post-training arxiv.org/pdf/2411.04330. TLDR; - Models become harder to post-train quantize as they
    697K
  • user avatar
    Tim Dettmers
    @Tim_Dettmers
    Oct 8, 2021
    I am excited to share my latest work: 8-bit optimizers – a replacement for regular optimizers. Faster 🚀, 75% less memory 🪶, same performance📈, no hyperparam tuning needed 🔢. 🧵/n Paper: arxiv.org/abs/2110.02861 Library: github.com/facebookresear… Video: youtube.com/watch?v=IxrlHA…
    Image
  • user avatar
    Tim Dettmers
    @Tim_Dettmers
    Sep 8, 2025
    It feels the coding agent frontier is now open-weights: GLM 4.5 costs only $3/month and is on par with Sonnet Kimi K2.1 Turbo is 3x speed, 7x cheaper vs Opus 4.1, but as good Kimi K2.1 feels clean. The best model for me. GPT-5 is only good for complicated specs -- too slow.
    243K
  • user avatar
    Tim Dettmers
    @Tim_Dettmers
    May 6, 2023
    Replying to @karpathy
    Super excited to push this even further: - Next week: bitsandbytes 4-bit closed beta that allows you to finetune 30B/65B LLaMA models on a single 24/48 GB GPU (no degradation vs full fine-tuning in 16-bit) - Two weeks: Full release of code, paper, and a collection of 65B models
    366K
  • user avatar
    Tim Dettmers
    @Tim_Dettmers
    Jun 6, 2023
    We present SpQR, which allows lossless LLM inference at 4.75 bits with a 15% speedup. You can run a 33B LLM on a single 24GB GPU fully lossless. SpQR works by isolating sensitive weights with higher precision and roughly doubles improvements from GPTQ: arxiv.org/abs/2306.03078🧵
    Image
    246K
  • user avatar
    Tim Dettmers
    @Tim_Dettmers
    Aug 17, 2022
    We release LLM.int8(), the first 8-bit inference method that saves 2x memory and does not degrade performance for 175B models by exploiting emergent properties. Read More: Paper: arxiv.org/abs/2208.07339 Software: huggingface.co/blog/hf-bitsan… Emergence: timdettmers.com/2022/08/17/llm…
    Image
  • user avatar
    Tim Dettmers
    @Tim_Dettmers
    Apr 8, 2020
    How can you successfully train transformers on small datasets like PTB and WikiText-2? Are LSTMs better on small datasets? I ran 339 experiments worth 568 GPU hours and came up with some answers. I do not have time to write a blog post, so here a twitter thread instead. 1/n
  • user avatar
    Tim Dettmers
    @Tim_Dettmers
    Dec 26, 2024
    Reading the report, this is such clean engineering under resource constraints. The DeepSeek team directly engineered solutions to known problems under hardware constraints. All of this looks so elegant -- no fancy "academic" solutions, just pure, solid engineering. Respect 👏
    user avatar
    DeepSeek
    @deepseek_ai
    Dec 26, 2024
    🚀 Introducing DeepSeek-V3! Biggest leap forward yet: ⚡ 60 tokens/second (3x faster than V2!) 💪 Enhanced capabilities 🛠 API compatibility intact 🌍 Fully open-source models & papers 🐋 1/n
    Image
    GIF
    Image
    90K
  • user avatar
    Tim Dettmers
    @Tim_Dettmers
    Aug 10, 2022
    We release the public beta for bnb-int8🟪 for all @huggingface 🤗models, which allows for Int8 inference without performance degradation up to scales of 176B params 📈. You can run OPT-175B/BLOOM-176B easily on a single machine 🖥️. You can try it here: docs.google.com/document/d/1Jx… 1/n
    Image
  • user avatar
    Tim Dettmers
    @Tim_Dettmers
    Sep 7, 2020
    Updated GPU recommendations for the new Ampere RTX 30 series are live! Performance benchmarks, architecture details, Q&A of frequently asked questions, and detailed explanations of how GPUs and Tensor Cores work for those that want to learn more: timdettmers.com/2020/09/07/whi…
    Image
  • user avatar
    Tim Dettmers
    @Tim_Dettmers
    Sep 12, 2025
    I should really write a blog post about how attention sinks relate to outliers and information processing in transformers. Almost all data is out there in papers, and if you pull things together it is easier to understand what is going on
    user avatar
    tensorqt
    @tensorqt
    Aug 24, 2025
    attention sinks may be a bias in causal transformers. as some of you know, i've been writing a long blogpost on attention and its properties as a message-passing operation on graphs. while doing so, i figured i might have found an explanation for which attention sinks may be an
    Image
    76K
  • user avatar
    Tim Dettmers
    @Tim_Dettmers
    Jan 16, 2023
    In the RTX 40 post, I introduce a GPU recommendation chart and discuss the new Tensor Memory Accelerator (TMA) and FP8 computation. Overall, RTX 40s are faster for inference and shine through their FP8 performance but are inefficient for 16-bit training. timdettmers.com/2023/01/16/whi…
    Image
    222K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up