Log inSign up
Red Hat AI
2,356 posts
Image
user avatar
Red Hat AI
@RedHat_AI
Accelerating AI innovation with open platforms and community. The future of AI is open.
ai.redhat.com
Joined May 2018
2,084
Following
11.1K
Followers
  • user avatar
    Red Hat AI
    @RedHat_AI
    Dec 24, 2022
    Accelerate your @huggingface 🤗 Inference Endpoints with DeepSparse to achieve 43x CPU speedup and 97% cost reduction over @PyTorch. Side note: DeepSparse is even faster than a T4 GPU 🤯 Learn more in our blog: neuralmagic.com/blog/accelerat…
    DeepSparse cost efficiency on inference endpoint compared to PyTorch CPU and PyTorch T4 GPU
    86K
  • user avatar
    Red Hat AI
    @RedHat_AI
    May 20, 2025
    LLM inference is too slow, too expensive, and too hard to scale. 🚨 Introducing llm-d, a Kubernetes-native distributed inference framework, to change that—using vLLM (@vllm_project), smart scheduling, and disaggregated compute. Here’s how it works—and how you can use it today:
    70K
  • user avatar
    Red Hat AI
    @RedHat_AI
    Sep 8, 2022
    HOT OFF THE PRESS! Neural Magic introduces sparsity and software-only ML execution to @MLPerf, boosting CPU performance 175X! 175X! Yes, you read that right. Read more about this amazing feat and replicate our results: hubs.li/Q01lThFP0
    MLPerf Inference v2.1 BERT-Large on CPUs
  • user avatar
    Red Hat AI
    @RedHat_AI
    Jul 23, 2024
    vLLM now supports deploying Llama-3.1-405B on a single 8xH100 or 8xA100 node, making inference much easier and cheaper! This is a huge feat by Neural Magic’s engineers who contributed 3 crucial features to enable immediate, FP8 deployments of the 405B model in vLLM: (1/5)
    Image
    61K
  • user avatar
    Red Hat AI
    @RedHat_AI
    Aug 15, 2024
    We're excited to introduce LLM Compressor, a library to compress LLMs for faster inference with vLLM. Our team used it to create fully quantized models like Llama 3.1 405B, recovering full accuracy and cutting costs 4x. Now, we're contributing it to the vLLM community! (1/6)
    Applying quantization to Llama 3.1 8B with the LLM Compressor
    45K
  • user avatar
    Red Hat AI
    @RedHat_AI
    Apr 29, 2022
    Transformers are huge. They are not efficient in deployment. But no worries. You can sparsify them with a few lines of code using SparseML: github.com/neuralmagic/sp… Result? More compression and better inference performance at the same accuracy. P.S. Same goes for CV models!
    Image
  • user avatar
    Red Hat AI
    @RedHat_AI
    May 9, 2023
    Did you know that you can use SparseGPT to apply one-shot sparsification to make large language models run faster on CPUs? This 🧵 explores how sparsification works, why it's a game changer for LLMs, and how to apply it to your models today. #Sparsification #LanguageModels
    Image
    25K
  • user avatar
    Red Hat AI
    @RedHat_AI
    Oct 18, 2023
    We applied our latest sparse fine-tuning research on the MPT-7b model, resulting in a 75% pruned model that doesn't drop accuracy. 🤯 75% fewer parameters means we can now run LLM inference performantly on commodity CPUs. @_mwitiderrick shares the details:
    Image
    Sparse LLM Inference on CPU
    From huggingface.co
    33K
  • user avatar
    Red Hat AI
    @RedHat_AI
    Mar 11, 2024
    Neural Magic is expanding to GPUs! Complementing our existing efforts with CPUs and model compression, we just launched nm-vllm, our initial community release to support GPU inference serving for LLMs. github.com/neuralmagic/nm… Details 👇
    Image
    GitHub - neuralmagic/nm-vllm: A high-throughput and memory-efficient inference and serving engine...
    From github.com
    43K
  • user avatar
    Red Hat AI
    @RedHat_AI
    Jul 30, 2024
    We further enhanced Meta's Llama 3.1 405B with full FP8 quantization. In other words, we quantized every linear module, unlike the original which skipped 510. RESULT: 20% less memory (~400GB vs 500GB), 99.74% accuracy recovery, no OOM errors. 😎 🦙
    Image
    RedHatAI/Meta-Llama-3.1-405B-Instruct-FP8 · Hugging Face
    From huggingface.co
    16K
  • user avatar
    Red Hat AI
    @RedHat_AI
    Nov 22, 2024
    LLM Compressor optimizes LLMs for faster inference and lower costs with minimal accuracy trade-offs. GitHub: github.com/vllm-project/l… Here’s @mgoin_ on what’s new in v0.3.0:
    Image
    00:00
    16K
  • user avatar
    Red Hat AI
    @RedHat_AI
    Oct 14, 2023
    Sparsity makes LLMs go 🚀 🚀 🚀 ….on ordinary CPUs. Here’s how: huggingface.co/spaces/neuralm…
    Image
    GIF
    82K
  • user avatar
    Red Hat AI
    @RedHat_AI
    Apr 19, 2022
    DeepSparse Engine runs DL models on everyday CPUs at GPU speeds! For latency-sensitive applications, it makes a 4-core Intel Macbook more performant than a T4 GPU and an 8-core server more performant than a V100 GPU. 🤯 <-- If this is not you right now, read this tweet again!
    Image
  • user avatar
    Red Hat AI
    @RedHat_AI
    Jan 16, 2024
    We've been taking popular fine-tuned LLMs from @huggingface and applying #SparseGPT to compress them 50% with sparsity and quantization to save on memory and compute during inference. Here are two examples: Llama 2 7b chat: huggingface.co/neuralmagic/Ll… Hermes 2 - Solar 10.7B:
    Image
    RedHatAI/Llama2-7b-chat-pruned50-quant-ds · Hugging Face
    From huggingface.co
    32K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms of Service|Privacy Policy|Cookie Policy|Accessibility|Ads info|© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up