Log inSign up
Michael Goin
1,136 posts
Image
user avatar
Michael Goin
@mgoin_
maintainer @vllm_project, inference perf @RedHat_AI (acq @neuralmagic), friends call me misha
Boston
github.com/mgoin
Joined April 2022
463
Following
1,664
Followers
  • Pinned
    user avatar
    Michael Goin
    @mgoin_
    Oct 10, 2025
    Happy that InferenceMAX is here because it signals a milestone for vLLM's SOTA performance on NVIDIA Blackwell! 🥳 It has been a pleasure to deeply collaborate with @nvidia in @vllm_project, and we have much more to do Read about the work we did here:
    user avatar
    Dylan Patel
    SemiAnalysis
    @dylan522p
    Oct 9, 2025
    Today we are launching InferenceMAX! We have support from Nvidia, AMD, OpenAI, Microsoft, Pytorch, SGLang, vLLM, Oracle, CoreWeave, TogetherAI, Nebius, Crusoe, HPE, SuperMicro, Dell It runs every day on the latest software (vLLM, SGLang, etc) across hundreds of GPUs, $10Ms of
    Image
    SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference
    From vllm.ai
    29K
  • user avatar
    Michael Goin
    @mgoin_
    Oct 12, 2023
    Exciting news from our latest LLM compression research! 🚀 Together with @ISTAustria and @neuralmagic, we’ve been exploring sparse finetuning for LLMs and achieved 7.7 tokens/second on a single core and at 26.7 tokens/second on 4 cores of an AMD Ryzen CPU! (1/n)
    Image
    22K
  • user avatar
    Michael Goin
    @mgoin_
    Dec 12, 2023
    I met my match with @abhi_venigalla at #NeurIPS2023. At least we’re both rich! 🤑
    Image
    5K
  • user avatar
    Michael Goin
    @mgoin_
    Jun 14, 2024
    Excited that FP8 in vLLM is getting better and better as we spend time on it. Check out our collection of accurate, pre-quantized checkpoints!
    user avatar
    vLLM
    @vllm_project
    Jun 14, 2024
    Replying to @vllm_project
    FP8 inference lowers latency, increases throughput, and reduces memory usage with very little accuracy drop. docs.vllm.ai/en/stable/quan…
    Image
    Image
    FP8 LLMs for vLLM - a neuralmagic Collection
    From huggingface.co
    7K
  • user avatar
    Michael Goin
    @mgoin_
    Sep 4, 2024
    My first collaboration on diffusion models - VQDM! yandex-research.github.io/vqdm/ With the right kind of quantization, SDXL and SDXL-Turbo compressed to 3 bit weights can offer the same quality as previous 4 bit methods. And, of course, 4 bit quality is better with VQDM🙂
    user avatar
    AK
    @_akhaliq
    Sep 4, 2024
    Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization paper page: huggingface.co/papers/2409.00… Text-to-image diffusion models have emerged as a powerful framework for high-quality image generation given textual prompts. Their success has driven the rapid
    Image
    yandex-research.github.io
    Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization
    This work introduces Vector Quantization for text-to-image Diffusion Models (VQDM), that provides highly accurate 3-4 bit compression of diffusion models with 2B+ parameters (SDXL and SDXL-Turbo).
    10K
  • user avatar
    Michael Goin
    @mgoin_
    Jul 18, 2024
    The first fully FP8 checkpoint of Mistral Nemo is here! huggingface.co/neuralmagic/Mi… It has been evaluated by @NeuralMagic on OpenLLM with >99% accuracy preservation and is ready to deploy in vLLM (make sure to build vLLM from main or use the nightly)
    Image
    RedHatAI/Mistral-Nemo-Instruct-2407-FP8 · Hugging Face
    From huggingface.co
    6.9K
  • user avatar
    Michael Goin
    @mgoin_
    Jul 24, 2024
    Nemotron-4-340B-Base is alive in vLLM! huggingface.co/mgoin/Nemotron… I converted and quantized the .nemo checkpoint to FP8 weights, then validated a few evaluations (TruthfulQA, Winogrande) match up with the paper results. Working on Instruct now, but we are close to escaping Nemo.
    Image
    mgoin/Nemotron-4-340B-Base-hf-FP8 · Hugging Face
    From huggingface.co
    4.6K
  • user avatar
    Michael Goin
    @mgoin_
    Jul 10, 2024
    If you still don't believe Llama 3 405B will be released, read this: dev-discuss.pytorch.org/t/meta-pytorch…
    3.6K
  • user avatar
    Michael Goin
    @mgoin_
    Jan 6, 2023
    #YOLOv5 has been awesome for building impressive object detection demos that scale from laptops to servers! Check out the 20+ sparsified versions @neuralmagic has made, from nano🤏 to xlarge🦣 sparsezoo.neuralmagic.com/?repo=ultralyt…
    user avatar
    Ultralytics
    @ultralytics
    Jan 6, 2023
    Announcing our new partnership 📣 Deploy #YOLOv5 with GPU-class performance on CPUs with @neuralmagic 🌟 Neural Magic is leading a new software-delivered AI movement by bringing hyper-performant and scalable ML inference to commodity CPU infrastructure bit.ly/3IvPNs3
    Image
    4K
  • user avatar
    Michael Goin
    @mgoin_
    Sep 5, 2024
    Come learn today at 2pm ET why we use @NVIDIAAI CUTLASS for high-throughput inference kernels in vLLM! If you are interested in peak quantized GEMM performance on GPUs, this is the talk to attend and ask questions
    user avatar
    Red Hat AI
    @RedHat_AI
    Aug 29, 2024
    🚨 vLLM Office Hours continue on Thursday, September 5th, at 2PM ET / 11AM PT! Tyler Smith (@tms_jr), vLLM Committer & Technical Director at Neural Magic, will dive deep into using NVIDIA CUTLASS for high-performance INT8 & FP8 vLLM inference. Sign up: neuralmagic.com/community-offi…
    Image
    2.3K
  • user avatar
    Michael Goin
    @mgoin_
    Dec 4, 2022
    Bye bye @dylan522p
    user avatar
    Mokaya
    @ekmokaya
    Dec 4, 2022
    ChatGPT writes a small deep dive on Nvidia's latest Q3 results. 😲 Do we need analysts anymore? 👀 $NVDA
    Image
    Image
  • user avatar
    Michael Goin
    @mgoin_
    Jul 15, 2024
    Our team at @neuralmagic collaborated with @anyscalecompute to integrate support for FP8 quantization and inference into vLLM. We achieved >99% accuracy preservation after FP8 quantization across a wide range of models, resulting in LLMs that are up to 2x faster!
    user avatar
    Red Hat AI
    @RedHat_AI
    Jul 15, 2024
    EXCITING NEWS: Neural Magic and @anyscalecompute contributed FP8 quantization support to the @vllm_project, making LLM inference more efficient. FP8 reduces latency on NVIDIA GPUs by 2x with >99% accuracy preservation. Cheers to @NVIDIAAIDev for validating our results. 1/6
    Inter-Token Latency (ITL) benchmarks for Llama 3 70B and Mixtral 8x7B on 2xH100.
    1.9K
  • user avatar
    Michael Goin
    @mgoin_
    Nov 10, 2022
    @AMD #Genoa runs 64 cameras at 30 FPS, each running YOLOv5 detection with @neuralmagic DeepSparse - an over 3.5X ML improvement over Milan! youtube.com/watch?v=FTXqN4…
  • user avatar
    Michael Goin
    @mgoin_
    Nov 30, 2023
    Llama llama, speedy drama 🦙 Zooming past in red pajamas 💨 First the flag, then the cheer 🏁 The fastest llamas have no fear! 🚀
    Image
    2.1K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms of Service|Privacy Policy|Cookie Policy|Accessibility|Ads info|© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up