Accelerate your @huggingface 🤗 Inference Endpoints with DeepSparse to achieve 43x CPU speedup and 97% cost reduction over @PyTorch.
Side note: DeepSparse is even faster than a T4 GPU 🤯
Learn more in our blog: neuralmagic.com/blog/accelerat…
Red Hat AI
2,356 posts
Accelerating AI innovation with open platforms and community.
The future of AI is open.
Joined May 2018
- LLM inference is too slow, too expensive, and too hard to scale. 🚨 Introducing llm-d, a Kubernetes-native distributed inference framework, to change that—using vLLM (@vllm_project), smart scheduling, and disaggregated compute. Here’s how it works—and how you can use it today:
- HOT OFF THE PRESS! Neural Magic introduces sparsity and software-only ML execution to @MLPerf, boosting CPU performance 175X! 175X! Yes, you read that right. Read more about this amazing feat and replicate our results: hubs.li/Q01lThFP0
- vLLM now supports deploying Llama-3.1-405B on a single 8xH100 or 8xA100 node, making inference much easier and cheaper! This is a huge feat by Neural Magic’s engineers who contributed 3 crucial features to enable immediate, FP8 deployments of the 405B model in vLLM: (1/5)
- We're excited to introduce LLM Compressor, a library to compress LLMs for faster inference with vLLM. Our team used it to create fully quantized models like Llama 3.1 405B, recovering full accuracy and cutting costs 4x. Now, we're contributing it to the vLLM community! (1/6)
- Transformers are huge. They are not efficient in deployment. But no worries. You can sparsify them with a few lines of code using SparseML: github.com/neuralmagic/sp… Result? More compression and better inference performance at the same accuracy. P.S. Same goes for CV models!
- Did you know that you can use SparseGPT to apply one-shot sparsification to make large language models run faster on CPUs? This 🧵 explores how sparsification works, why it's a game changer for LLMs, and how to apply it to your models today. #Sparsification #LanguageModels
- We applied our latest sparse fine-tuning research on the MPT-7b model, resulting in a 75% pruned model that doesn't drop accuracy. 🤯 75% fewer parameters means we can now run LLM inference performantly on commodity CPUs. @_mwitiderrick shares the details:
- Neural Magic is expanding to GPUs! Complementing our existing efforts with CPUs and model compression, we just launched nm-vllm, our initial community release to support GPU inference serving for LLMs. github.com/neuralmagic/nm… Details 👇
- We further enhanced Meta's Llama 3.1 405B with full FP8 quantization. In other words, we quantized every linear module, unlike the original which skipped 510. RESULT: 20% less memory (~400GB vs 500GB), 99.74% accuracy recovery, no OOM errors. 😎 🦙
- LLM Compressor optimizes LLMs for faster inference and lower costs with minimal accuracy trade-offs. GitHub: github.com/vllm-project/l… Here’s @mgoin_ on what’s new in v0.3.0:
00:00 - Sparsity makes LLMs go 🚀 🚀 🚀 ….on ordinary CPUs. Here’s how: huggingface.co/spaces/neuralm…
GIF - DeepSparse Engine runs DL models on everyday CPUs at GPU speeds! For latency-sensitive applications, it makes a 4-core Intel Macbook more performant than a T4 GPU and an 8-core server more performant than a V100 GPU. 🤯 <-- If this is not you right now, read this tweet again!
- We've been taking popular fine-tuned LLMs from @huggingface and applying #SparseGPT to compress them 50% with sparsity and quantization to save on memory and compute during inference. Here are two examples: Llama 2 7b chat: huggingface.co/neuralmagic/Ll… Hermes 2 - Solar 10.7B:








