Log inSign up
vLLM
1,062 posts
Image
user avatar
vLLM
@vllm_project
A high-throughput and memory-efficient inference and serving engine for LLMs. Join slack.vllm.ai to discuss together with the community!
vllm.ai
Joined March 2024
36
Following
41.9K
Followers

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

TermsยทPrivacyยทCookiesยทAccessibilityยทAds Infoยทยฉ 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
  • user avatar
    vLLM
    @vllm_project
    Oct 20, 2025
    ๐Ÿš€ DeepSeek-OCR โ€” the new frontier of OCR from @deepseek_ai , exploring optical context compression for LLMs, is running blazingly fast on vLLM โšก (~2500 tokens/s on A100-40G) โ€” powered by vllm==0.8.5 for day-0 model support. ๐Ÿง  Compresses visual contexts up to 20ร— while keeping
    Image
    Image
    Image
    1.5M
  • user avatar
    vLLM
    @vllm_project
    Apr 14, 2025
    ๐Ÿ™ @deepseek_ai's highly performant inference engine is built on top of vLLM. Now they are open-sourcing the engine the right way: instead of a separate repo, they are bringing changes to the open source community so everyone can immediately benefit!
    github.com
    open-infra-index/OpenSourcing_DeepSeek_Inference_Engine/README.md at main ยท deepseek-ai/open-infr...
    Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation - deepseek-ai/open-infra-index
    203K
  • user avatar
    vLLM
    @vllm_project
    Nov 3, 2025
    Wow excited to see PewDiePie using vLLM to serve language models locally ๐Ÿ˜ƒ vLLM brings easy, fast, and cheap LLM serving for everyone ๐Ÿฅฐ
    user avatar
    Yuchen Jin
    @Yuchenj_UW
    Oct 31, 2025
    PewDiePie in 2025: โ€“ built a 10ร—4090 rig โ€“ runs Llama 70B, gpt-oss-120B & Qwen 245B locally via vLLM โ€“ built a custom web UI (chat, RAG, search, TTS) โ€“ ran protein-folding simulations for charity โ€“ created an AI โ€œcouncilโ€, a swarm of 64 models โ€“ now fine-tuning his own model
    Image
    164K
  • user avatar
    vLLM
    @vllm_project
    Sep 18, 2025
    Congrats to @deepseek_ai ! DeepSeek-R1 was published in Nature yesterday as the cover article, and vLLM is proud to have supported its RL training and inference๐Ÿฅฐ
    Image
    Image
    213K
  • user avatar
    vLLM
    @vllm_project
    Aug 17, 2025
    ๐Ÿš€ Amazing community project! vLLM CLI โ€” a command-line tool for serving LLMs with vLLM: โœ… Interactive menu-driven UI & scripting-friendly CLI โœ… Local + HuggingFace Hub model management โœ… Config profiles for perf/memory tuning โœ… Real-time server & GPU monitoring โœ… Error
    Image
    71K
  • user avatar
    vLLM
    @vllm_project
    Feb 21, 2025
    We're excited to receive our first #NVIDIADGX B200 system which we'll use for vLLM research and development! Thank you @nvidia!
    Image
    119K
  • user avatar
    vLLM
    @vllm_project
    Oct 16, 2025
    Announcing the completely reimagined vLLM TPU! In collaboration with @Google, we've launched a new high-performance TPU backend unifying @PyTorch and JAX under a single lowering path for amazing performance and flexibility. ๐Ÿš€ What's New? - JAX + Pytorch: Run PyTorch models on
    Image
    157K
  • user avatar
    vLLM
    @vllm_project
    Apr 17, 2025
    vLLM๐Ÿค๐Ÿค—! You can now deploy any @huggingface language model with vLLM's speed. This integration makes it possible for one consistent implementation of the model in HF for both training and inference. ๐Ÿงต
    Image
    Transformers modeling backend integration in vLLM
    From vllm.ai
    73K
  • user avatar
    vLLM
    @vllm_project
    Feb 1, 2025
    We landed the 1st batch of enhancements to the @deepseek_ai models, starting MLA and cutlass fp8 kernels. Compared to v0.7.0, we offer ~3x the generation throughput, ~10x the memory capacity for tokens, and horizontal context scalability with pipeline parallelism.
    Image
    90K
  • user avatar
    vLLM
    @vllm_project
    Sep 29, 2025
    How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.
    Image
    103K
  • user avatar
    vLLM
    @vllm_project
    Sep 28, 2025
    ๐Ÿš€ New in vLLM: dots.ocr ๐Ÿ”ฅ A powerful multilingual OCR model from @xiaohongshu hi lab is now officially supported in vLLM! ๐Ÿ“ Single end-to-end parser for text, tables (HTML), formulas (LaTeX), and layouts (Markdown) ๐ŸŒ Supports 100 languages with robust performance on
    Image
    Image
    user avatar
    merve
    @mervenoyann
    Aug 5, 2025
    we're all sleeping on this OCR model ๐Ÿ”ฅ dots.ocr is a new 3B model with sota performance, support for 100 languages & allowing commercial use! ๐Ÿคฏ single e2e model to extract image, convert tables, formula, and more into markdown ๐Ÿ“
    69K
  • user avatar
    vLLM
    @vllm_project
    Oct 22, 2025
    itโ€™s tokenization again! ๐Ÿคฏ did you know tokenize(detokenize(token_ids)) โ‰  token_ids? RL researchers from Agent Lightning coined the term Retokenization Drift โ€” a subtle mismatch between what your model generated and what your trainer thinks it generated. why? because most
    170K
  • user avatar
    vLLM
    @vllm_project
    Jan 27, 2025
    ๐Ÿš€ With the v0.7.0 release today, we are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more.
    Image
    95K
  • user avatar
    vLLM
    @vllm_project
    Sep 9, 2025
    The amazing blogpost from @gordic_aleksa is alive at vLLM's blogpost blog.vllm.ai/2025/09/05/anaโ€ฆ (after more proofreading and clarifications)! Looking forward to future series of tech deep dive blogposts๐Ÿ˜
    user avatar
    Aleksa Gordiฤ‡ (ๆฐดๅนณ้—ฎ้ข˜)
    @gordic_aleksa
    Sep 1, 2025
    New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work! Took me a while to get this level of understanding of the codebase and then to write up
    Image
    Image
    Inside vLLM: Anatomy of a High-Throughput LLM Inference System
    From vllm.ai
    47K
This post is unavailable.