Michael Goin (@mgoin

Michael Goin

1,136 posts

Michael Goin

@mgoin_

maintainer @vllm_project, inference perf @RedHat_AI (acq @neuralmagic), friends call me misha

Boston

Joined April 2022

Pinned
Michael Goin
@mgoin_
Oct 10, 2025
Happy that InferenceMAX is here because it signals a milestone for vLLM's SOTA performance on NVIDIA Blackwell! 🥳 It has been a pleasure to deeply collaborate with @nvidia in @vllm_project, and we have much more to do Read about the work we did here:
Dylan Patel
@dylan522p
Oct 9, 2025
Today we are launching InferenceMAX! We have support from Nvidia, AMD, OpenAI, Microsoft, Pytorch, SGLang, vLLM, Oracle, CoreWeave, TogetherAI, Nebius, Crusoe, HPE, SuperMicro, Dell It runs every day on the latest software (vLLM, SGLang, etc) across hundreds of GPUs, $10Ms of
SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference
From vllm.ai
29K
Michael Goin
@mgoin_
Oct 12, 2023
Exciting news from our latest LLM compression research! 🚀 Together with @ISTAustria and @neuralmagic, we’ve been exploring sparse finetuning for LLMs and achieved 7.7 tokens/second on a single core and at 26.7 tokens/second on 4 cores of an AMD Ryzen CPU! (1/n)
22K
Michael Goin
@mgoin_
Dec 12, 2023
I met my match with @abhi_venigalla at #NeurIPS2023. At least we’re both rich! 🤑
5K
Michael Goin
@mgoin_
Jun 14, 2024
Excited that FP8 in vLLM is getting better and better as we spend time on it. Check out our collection of accurate, pre-quantized checkpoints!
vLLM
@vllm_project
Jun 14, 2024
Replying to @vllm_project
FP8 inference lowers latency, increases throughput, and reduces memory usage with very little accuracy drop. docs.vllm.ai/en/stable/quan…
FP8 LLMs for vLLM - a neuralmagic Collection
From huggingface.co
7K
Michael Goin
@mgoin_
Sep 4, 2024
My first collaboration on diffusion models - VQDM! yandex-research.github.io/vqdm/ With the right kind of quantization, SDXL and SDXL-Turbo compressed to 3 bit weights can offer the same quality as previous 4 bit methods. And, of course, 4 bit quality is better with VQDM🙂
AK
@_akhaliq
Sep 4, 2024
Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization paper page: huggingface.co/papers/2409.00… Text-to-image diffusion models have emerged as a powerful framework for high-quality image generation given textual prompts. Their success has driven the rapid
yandex-research.github.io
Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization
This work introduces Vector Quantization for text-to-image Diffusion Models (VQDM), that provides highly accurate 3-4 bit compression of diffusion models with 2B+ parameters (SDXL and SDXL-Turbo).
10K
Michael Goin
@mgoin_
Jul 18, 2024
The first fully FP8 checkpoint of Mistral Nemo is here! huggingface.co/neuralmagic/Mi… It has been evaluated by @NeuralMagic on OpenLLM with >99% accuracy preservation and is ready to deploy in vLLM (make sure to build vLLM from main or use the nightly)
RedHatAI/Mistral-Nemo-Instruct-2407-FP8 · Hugging Face
From huggingface.co
6.9K
Michael Goin
@mgoin_
Jul 24, 2024
Nemotron-4-340B-Base is alive in vLLM! huggingface.co/mgoin/Nemotron… I converted and quantized the .nemo checkpoint to FP8 weights, then validated a few evaluations (TruthfulQA, Winogrande) match up with the paper results. Working on Instruct now, but we are close to escaping Nemo.
mgoin/Nemotron-4-340B-Base-hf-FP8 · Hugging Face
From huggingface.co
4.6K
Michael Goin
@mgoin_
Jul 10, 2024
If you still don't believe Llama 3 405B will be released, read this: dev-discuss.pytorch.org/t/meta-pytorch…
3.6K
Michael Goin
@mgoin_
Jan 6, 2023
#YOLOv5 has been awesome for building impressive object detection demos that scale from laptops to servers! Check out the 20+ sparsified versions @neuralmagic has made, from nano🤏 to xlarge🦣 sparsezoo.neuralmagic.com/?repo=ultralyt…
Ultralytics
@ultralytics
Jan 6, 2023
Announcing our new partnership 📣 Deploy #YOLOv5 with GPU-class performance on CPUs with @neuralmagic 🌟 Neural Magic is leading a new software-delivered AI movement by bringing hyper-performant and scalable ML inference to commodity CPU infrastructure bit.ly/3IvPNs3
4K
Michael Goin
@mgoin_
Sep 5, 2024
Come learn today at 2pm ET why we use @NVIDIAAI CUTLASS for high-throughput inference kernels in vLLM! If you are interested in peak quantized GEMM performance on GPUs, this is the talk to attend and ask questions
Red Hat AI
@RedHat_AI
Aug 29, 2024
🚨 vLLM Office Hours continue on Thursday, September 5th, at 2PM ET / 11AM PT! Tyler Smith (@tms_jr), vLLM Committer & Technical Director at Neural Magic, will dive deep into using NVIDIA CUTLASS for high-performance INT8 & FP8 vLLM inference. Sign up: neuralmagic.com/community-offi…
2.3K
Michael Goin
@mgoin_
Dec 4, 2022
Bye bye @dylan522p
Mokaya
@ekmokaya
Dec 4, 2022
ChatGPT writes a small deep dive on Nvidia's latest Q3 results. 😲 Do we need analysts anymore? 👀 $NVDA
Michael Goin
@mgoin_
Jul 15, 2024
Our team at @neuralmagic collaborated with @anyscalecompute to integrate support for FP8 quantization and inference into vLLM. We achieved >99% accuracy preservation after FP8 quantization across a wide range of models, resulting in LLMs that are up to 2x faster!
Red Hat AI
@RedHat_AI
Jul 15, 2024
EXCITING NEWS: Neural Magic and @anyscalecompute contributed FP8 quantization support to the @vllm_project, making LLM inference more efficient. FP8 reduces latency on NVIDIA GPUs by 2x with >99% accuracy preservation. Cheers to @NVIDIAAIDev for validating our results. 1/6
1.9K
Michael Goin
@mgoin_
Nov 10, 2022
@AMD #Genoa runs 64 cameras at 30 FPS, each running YOLOv5 detection with @neuralmagic DeepSparse - an over 3.5X ML improvement over Milan! youtube.com/watch?v=FTXqN4…
Michael Goin
@mgoin_
Nov 30, 2023
Llama llama, speedy drama 🦙 Zooming past in red pajamas 💨 First the flag, then the cheer 🏁 The fastest llamas have no fear! 🚀
2.1K