Red Hat AI (@RedHat

Red Hat AI

2,356 posts

Red Hat AI

@RedHat_AI

Accelerating AI innovation with open platforms and community. The future of AI is open.

Joined May 2018

Red Hat AI
@RedHat_AI
Dec 24, 2022
Accelerate your @huggingface 🤗 Inference Endpoints with DeepSparse to achieve 43x CPU speedup and 97% cost reduction over @PyTorch. Side note: DeepSparse is even faster than a T4 GPU 🤯 Learn more in our blog: neuralmagic.com/blog/accelerat…
86K
Red Hat AI
@RedHat_AI
May 20, 2025
LLM inference is too slow, too expensive, and too hard to scale. 🚨 Introducing llm-d, a Kubernetes-native distributed inference framework, to change that—using vLLM (@vllm_project), smart scheduling, and disaggregated compute. Here’s how it works—and how you can use it today:
70K
Red Hat AI
@RedHat_AI
Sep 8, 2022
HOT OFF THE PRESS! Neural Magic introduces sparsity and software-only ML execution to @MLPerf, boosting CPU performance 175X! 175X! Yes, you read that right. Read more about this amazing feat and replicate our results: hubs.li/Q01lThFP0
Red Hat AI
@RedHat_AI
Jul 23, 2024
vLLM now supports deploying Llama-3.1-405B on a single 8xH100 or 8xA100 node, making inference much easier and cheaper! This is a huge feat by Neural Magic’s engineers who contributed 3 crucial features to enable immediate, FP8 deployments of the 405B model in vLLM: (1/5)
61K
Red Hat AI
@RedHat_AI
Aug 15, 2024
We're excited to introduce LLM Compressor, a library to compress LLMs for faster inference with vLLM. Our team used it to create fully quantized models like Llama 3.1 405B, recovering full accuracy and cutting costs 4x. Now, we're contributing it to the vLLM community! (1/6)
45K
Red Hat AI
@RedHat_AI
Apr 29, 2022
Transformers are huge. They are not efficient in deployment. But no worries. You can sparsify them with a few lines of code using SparseML: github.com/neuralmagic/sp… Result? More compression and better inference performance at the same accuracy. P.S. Same goes for CV models!
Red Hat AI
@RedHat_AI
May 9, 2023
Did you know that you can use SparseGPT to apply one-shot sparsification to make large language models run faster on CPUs? This 🧵 explores how sparsification works, why it's a game changer for LLMs, and how to apply it to your models today. #Sparsification #LanguageModels
25K
Red Hat AI
@RedHat_AI
Oct 18, 2023
We applied our latest sparse fine-tuning research on the MPT-7b model, resulting in a 75% pruned model that doesn't drop accuracy. 🤯 75% fewer parameters means we can now run LLM inference performantly on commodity CPUs. @_mwitiderrick shares the details:
Sparse LLM Inference on CPU
From huggingface.co
33K
Red Hat AI
@RedHat_AI
Mar 11, 2024
Neural Magic is expanding to GPUs! Complementing our existing efforts with CPUs and model compression, we just launched nm-vllm, our initial community release to support GPU inference serving for LLMs. github.com/neuralmagic/nm… Details 👇
GitHub - neuralmagic/nm-vllm: A high-throughput and memory-efficient inference and serving engine...
From github.com
43K
Red Hat AI
@RedHat_AI
Jul 30, 2024
We further enhanced Meta's Llama 3.1 405B with full FP8 quantization. In other words, we quantized every linear module, unlike the original which skipped 510. RESULT: 20% less memory (~400GB vs 500GB), 99.74% accuracy recovery, no OOM errors. 😎 🦙
RedHatAI/Meta-Llama-3.1-405B-Instruct-FP8 · Hugging Face
From huggingface.co
16K
Red Hat AI
@RedHat_AI
Nov 22, 2024
LLM Compressor optimizes LLMs for faster inference and lower costs with minimal accuracy trade-offs. GitHub: github.com/vllm-project/l… Here’s @mgoin_ on what’s new in v0.3.0:
00:00
16K
Red Hat AI
@RedHat_AI
Oct 14, 2023
Sparsity makes LLMs go 🚀 🚀 🚀 ….on ordinary CPUs. Here’s how: huggingface.co/spaces/neuralm…
GIF
82K
Red Hat AI
@RedHat_AI
Apr 19, 2022
DeepSparse Engine runs DL models on everyday CPUs at GPU speeds! For latency-sensitive applications, it makes a 4-core Intel Macbook more performant than a T4 GPU and an 8-core server more performant than a V100 GPU. 🤯 <-- If this is not you right now, read this tweet again!
Red Hat AI
@RedHat_AI
Jan 16, 2024
We've been taking popular fine-tuned LLMs from @huggingface and applying #SparseGPT to compress them 50% with sparsity and quantization to save on memory and compute during inference. Here are two examples: Llama 2 7b chat: huggingface.co/neuralmagic/Ll… Hermes 2 - Solar 10.7B:
RedHatAI/Llama2-7b-chat-pruned50-quant-ds · Hugging Face
From huggingface.co
32K