Stories by Patric on Medium

AI Agents in Production: The Lifecycle Problem Nobody Talks About

Patric — Sun, 18 Jan 2026 10:11:24 GMT

Bridging the gap between AI agent prototypes and production-ready SaaS systems

The Moment Everything Falls Apart

You’ve built an AI agent. It works beautifully in your notebook. The demo impresses everyone. Then you try to deploy it to production and suddenly you’re dealing with:

Users expecting instant responses (your agent takes 30 seconds)
Prompt changes breaking everything (no rollback strategy)
Costs spiraling (you’re burning $500/day on a feature 3 people use)
Support tickets you can’t debug (“It gave me a weird answer yesterday”)

The problem isn’t your code. It’s that agents have a lifecycle that nobody designed for.

Let me show you what I mean.

The Agent Lifecycle (What Actually Happens)

Here’s what really happens when a user interacts with your agent in production:

Notice what’s missing from most tutorials:

Version routing
Guardrails
Monitoring that actually helps

Let’s fix that.

Problem #1: Versioning Agents (Not Just Prompts)

Here’s what everyone does wrong:

// The naive approach
const agent = {
  prompt: "You are a helpful assistant...",
  model: "gpt-4",
  temperature: 0.7
}

What happens when you change the prompt?

Every user gets the new version immediately
No A/B testing
No rollback if it breaks
No idea which version caused the bug

The Solution: Treat Agents Like Software Releases

Key insight: Users should be pinned to a version until you explicitly migrate them.

Problem #2: The Cost-Performance Tradeoff

Every agent call has a hidden decision tree:

Most tutorials skip this entirely. But in production:

60% of requests don’t need GPT-4
30% could be cached
10% need the heavy model

Result: You can cut costs by 70% with smart routing.

Problem #3: Monitoring That Actually Helps

When something goes wrong, you need to know:

Which version of the agent was used?
What was the exact prompt sent to the LLM?
How much did it cost?
Did guardrails trigger?

The Monitoring Stack

Critical: You need structured logging from day one.

The Multi-Tenant Challenge

If you’re building SaaS, you have an extra problem: different customers need different agent behavior.

Key decisions:

Shared core vs. per-customer forks
Override hierarchy: Global → Tenant → User
Isolation: How do you prevent data leakage?

Putting It Together: The Production-Ready Architecture

Here’s the full picture:

What This Means for You

If you’re building agents for production, you need to think about:

Versioning from day one (not after you break production)
Cost optimization as a first-class concern (not an afterthought)
Observability that lets you debug what actually happened (not just “it failed”)
Multi-tenancy if you’re SaaS (different customers = different configs)

The good news: Most of this is just structured thinking. You don’t need fancy tools.

The bad news: Not many are teaching this. Most tutorials stops at “here’s how to call OpenAI.”

Next Up

In Part 2, I’ll show you:

Concrete code for versioned agent execution
A guardrails layer that actually works
How to A/B test prompt changes safely

Part 3 will cover:

Eval frameworks that run in production
Rollback strategies when agents break
Multi-tenant prompt injection (the scary stuff)

Want to see the code? All examples are in the companion repo: ai-agents-saas-edition

Building agents in production? I’d love to hear what challenges you’re facing. Connect with me on LinkedIn.

The core insight: Agents aren’t functions. They’re services with lifecycles, versions, and SLAs. Treat them like that from day one and you’ll save yourself months of pain.

The GGUF Format Explained: Making AI Models Run Anywhere (Even on Your Laptop)

Patric — Thu, 18 Dec 2025 19:13:17 GMT

Ever wondered how people run powerful AI models like Llama on regular laptops without a supercomputer? The secret lies in a clever file format called GGUF. Let’s explore what it is, where it came from, and why it’s revolutionizing how we use large language models.

What Problem Does GGUF Solve?

Imagine trying to fit an entire library into your backpack. That’s essentially what we’re doing when we try to run modern AI models on regular computers. Models like GPT or Llama can be tens or even hundreds of gigabytes in size, requiring massive amounts of RAM and powerful GPUs to run.

This is where GGUF comes in. Think of it as a compression technique specifically designed for AI models, similar to how ZIP files compress documents, but much smarter. GGUF doesn’t just shrink the file size — it reorganizes how the model is stored so it can run efficiently on everyday hardware.

Real-world impact: With GGUF, a model that normally requires 64GB of RAM and a high-end GPU can run on a laptop with 16GB of RAM and just a CPU. That’s democratizing AI in action.

The Origins: From GGML to GGUF

To understand GGUF, we need to take a quick trip back to 2022.

The GGML Era

Developer Georgi Gerganov created GGML (a combination of his initials “GG” and “ML” for machine learning) in late 2022 as a tensor library focused on making AI models accessible on standard hardware. Before building llama.cpp, Gerganov had already proven the concept with whisper.cpp, which brought OpenAI’s Whisper speech-to-text model to consumer devices.

llama.cpp began development in March 2023 as a pure C/C++ implementation with no dependencies, designed to run on CPUs including smartphones. The project gained massive traction, accumulating over 85,000 stars on GitHub, because it solved a real problem: making powerful AI accessible without specialized hardware.

However, GGML had limitations. Adding new features often broke compatibility with existing models, and the format lacked flexibility for storing essential metadata like tokenizer information or model-specific parameters.

Enter GGUF

On August 21st, 2023, the llama.cpp team introduced GGUF (GPT-Generated Unified Format) as a replacement for GGML. This wasn’t just an incremental update — it was a complete redesign addressing GGML’s shortcomings.

GGUF was designed to be extensible and capable of incorporating new information without breaking compatibility with older models. It combines model parameters with comprehensive metadata in a single binary file, making models truly portable and self-contained.

How GGUF Works: The Technical Magic

The Quantization Game

At the heart of GGUF’s efficiency is quantization — the art of representing numbers with fewer bits while maintaining acceptable accuracy.

Here’s the concept in simple terms: Imagine you’re an artist with a palette of 16 million colors (standard for digital images). Quantization is like choosing to work with only 256 colors instead. Yes, you lose some nuance, but for many purposes, the result is still excellent — and your artwork takes up far less space.

In AI models, the weights (the learned parameters that make the model work) are typically stored as 32-bit or 16-bit floating-point numbers. GGUF supports quantization from as low as 2-bit to 8-bit integers, along with standard formats like float32, float16, and bfloat16.

Here’s what different quantization levels mean in practice:

Q2_K (2-bit): The most aggressive compression, roughly 2.5 bits per weight. Great for testing or when resources are extremely limited, but expect noticeable quality loss.

Q4_K (4-bit): The sweet spot for most users — uses about 4.5 bits per weight. Offers excellent balance between size and quality.

Q5_K (5-bit): Higher quality, slightly larger files. Good when you have a bit more RAM to spare.

Q8_0 (8-bit): Nearly indistinguishable from the original in most cases, but still half the size of 16-bit models.

F16/F32: Full precision formats for when quality is paramount and you have the resources.

The File Structure

A GGUF file consists of four main sections written sequentially: header, metadata key-value pairs, tensor information, and the tensor data itself.

Think of it like a well-organized filing cabinet:

Header: The label on the outside telling you what’s inside and how to open it
Metadata: The index cards with all the important information about the model
Tensor Info: The catalog listing what’s stored where
Tensor Data: The actual files containing the model weights

This structure includes everything necessary for running a GPT-like language model: tokenizer vocabulary, context length, tensor information, and other attributes.

Where GGUF Came From: The llama.cpp Project

GGUF is inseparable from llama.cpp, the project that created and maintains it. The creation of GGML was inspired by Fabrice Bellard’s work on LibNC, and the entire effort has been focused on one goal: making AI models work efficiently on consumer hardware.

The project supports an impressive array of hardware targets: x86, ARM, Metal (for Apple Silicon), CUDA (NVIDIA GPUs), ROCm (AMD GPUs), and more. It uses CPU optimizations like AVX, AVX2, and AVX-512 on Intel/AMD processors, and NEON on ARM devices.

What started as an experiment in March 2023 has become the foundation for running local AI models worldwide.

How GGUF Is Used Today

1. Running Models Locally

The most common use case is running large language models on your own computer. Here’s a quick example:

# Download llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Build it
cmake -B build
cmake --build build --config Release

# Run a model
./build/bin/llama-cli -m path/to/model.gguf -p "Hello, world!"

That’s it. No cloud services, no API keys, no privacy concerns about your data leaving your machine.

2. Converting Models to GGUF

You can convert almost any Hugging Face model to GGUF format:

# Download a model from Hugging Face
from huggingface_hub import snapshot_download
snapshot_download(repo_id="meta-llama/Llama-3.2-3B", local_dir="model")

# Convert to FP16 first
python llama.cpp/convert_hf_to_gguf.py model --outtype f16 --outfile model-fp16.gguf

# Then quantize to desired level
./llama.cpp/build/bin/llama-quantize model-fp16.gguf model-q4.gguf Q4_K_M

3. Desktop Applications

Several user-friendly applications have emerged that use GGUF under the hood:

LM Studio: A polished GUI for running models on Windows and macOS
Text Generation WebUI: A feature-rich web interface with GPU support
KoboldCpp: Popular for creative writing and storytelling
Jan: An open-source ChatGPT alternative that runs locally

4. Python Integration

You can use GGUF models directly in Python applications:

from llama_cpp import Llama

# Load a GGUF model
llm = Llama(
    model_path="model.gguf",
    n_gpu_layers=20,  # Offload some layers to GPU if available
    n_ctx=4096,       # Context window size
)

# Generate text
output = llm(
    "Explain quantum computing in simple terms:",
    max_tokens=200,
    temperature=0.7
)

print(output['choices'][0]['text'])

5. API Servers

llama.cpp includes a server mode that provides OpenAI-compatible API endpoints, meaning you can run local models but use the same code that works with ChatGPT:

# Start a server
./llama-server -m model.gguf --port 8080

# Use it like OpenAI's API
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

The Quantization Quality Spectrum

Understanding which quantization level to choose depends on your use case:

For Experimentation (Q2_K, Q3_K):

Smallest file sizes
Fastest inference
Noticeable quality degradation
Good for testing if a model works for your purpose

For Production Use (Q4_K_M, Q4_K_S):

Excellent balance
4–5x smaller than original
Minimal quality loss for most tasks
Most popular choice

For Professional Applications (Q5_K, Q6_K):

Higher quality
Good for tasks requiring nuance
Still 50–70% smaller than original

For Maximum Quality (Q8_0, F16):

Nearly identical to original
When you have the RAM and need the best results
Research or evaluation

Real-World Applications

Education

Students can now run AI models on laptops for learning without expensive cloud bills. A programming student can have a code assistant running locally while learning.

Privacy-Sensitive Industries

Healthcare, legal, and financial sectors can use AI without sending sensitive data to external APIs. A law firm can analyze contracts with AI while keeping client information on premises.

Offline Applications

Researchers in remote locations, airlines, ships, and other disconnected environments can use AI capabilities without internet access.

Development and Testing

Developers can iterate quickly without API rate limits or costs. You can test thousands of prompts without worrying about your bill.

Edge Devices

Running AI models on smartphones, embedded systems, and IoT devices becomes feasible with GGUF’s efficiency.

The Ecosystem Around GGUF

The format has spawned an entire ecosystem:

Model Repositories: Hugging Face hosts thousands of GGUF models, with users like TheBloke (now maintained by the community) providing pre-quantized versions of popular models.

Conversion Tools: Automated tools for converting models from PyTorch, TensorFlow, and other frameworks to GGUF.

Hardware Optimizations: Continuous improvements for Apple Silicon, AMD GPUs, and various CPU architectures.

Community Tools: Model merging utilities, fine-tuning workflows, and performance benchmarking tools.

Advantages Over Other Formats

Why GGUF over alternatives like ONNX or GPTQ?

Versus ONNX:

GGUF is specifically optimized for LLMs, not general neural networks
Better quantization support for language models
Simpler deployment without additional dependencies

Versus GPTQ:

GPTQ requires GPU for inference; GGUF works on CPUs
GGUF offers more quantization options
GGUF files are self-contained with all metadata

Versus Original PyTorch Models:

4–8x smaller file sizes
No Python runtime required
Cross-platform compatibility without framework dependencies

Practical Tips for Working with GGUF

Choosing the Right Quantization

Start with Q4_K_M for most use cases. If the quality isn’t sufficient, move up to Q5_K. If you need maximum speed and size isn’t an issue, try Q8_0.

Memory Considerations

Model parameters are offloaded between system RAM and GPU VRAM based on the n_gpu_layers setting. If you have 8GB of VRAM, you might offload 20–30 layers to GPU and keep the rest in RAM.

Context Length

Longer context windows require more memory. A 4K context uses significantly less RAM than 8K or 16K. Start small and increase as needed.

File Naming Conventions

GGUF files follow naming patterns like:

llama-2-7b.Q4_K_M.gguf — Model name, quantization method
mistral-7b-instruct-q5_k.gguf — Lowercase variations exist too

The pattern helps you identify what you’re downloading at a glance.

The Future of GGUF

The format continues to evolve. Recent developments include:

Support for multimodal models (combining text and images)
FlashAttention integration for faster processing
Better memory mapping for ultra-large models
Improved quantization methods balancing quality and size

The format is designed to be extensible, allowing new features to be added without breaking compatibility with existing models, ensuring GGUF will remain relevant as AI technology advances.

Getting Started Today

Want to try GGUF yourself? Here’s the quickest path:

Download LM Studio (easiest for beginners) — it handles everything with a GUI
Or install llama.cpp if you prefer command-line control
Find a model on Hugging Face (search for “GGUF” in the model name)
Start with a smaller model (3B-7B parameters) to see how it performs on your hardware
Experiment with quantization levels to find your ideal balance

Why GGUF Matters

In the broader context of AI democratization, GGUF represents a critical stepping stone. It proves that powerful AI doesn’t require data centers or expensive hardware. It puts sophisticated language models in the hands of students, researchers, small businesses, and individuals worldwide.

The format exemplifies open-source collaboration at its best — created by the community, for the community, and continuously improved by thousands of contributors. It’s not controlled by any single company and works with models from any source.

As AI becomes increasingly central to how we work and create, formats like GGUF ensure that the technology remains accessible, private, and under user control. That’s the kind of future worth building.

Key Takeaways

GGUF is a highly optimized file format for running large AI models efficiently
Created by Georgi Gerganov and the llama.cpp team, introduced in August 2023
Uses quantization to compress models by 4–8x with minimal quality loss
Enables running sophisticated AI models on regular laptops and even smartphones
Self-contained format including all metadata, vocabulary, and model weights
Supported by a rich ecosystem of tools, applications, and thousands of models
Perfect for privacy-conscious users, offline applications, and cost-effective AI deployment

The next time you see a “.gguf” file extension, you’ll know it’s not just a model — it’s an entire movement toward making AI accessible to everyone.

Want to explore more about GGUF and local AI models? Check out the llama.cpp project on GitHub and the GGUF model collection on Hugging Face. The community is active, welcoming, and always happy to help newcomers get started.

Understanding Binary Files: A Beginner’s Guide to Reading Them in JavaScript

Patric — Thu, 18 Dec 2025 18:56:40 GMT

Have you ever wondered what’s really inside an image file, a PDF, or a video? Unlike the text files you’re used to working with, these are binary files — and they speak a different language. Let’s demystify what binary files are and learn how to read them using JavaScript.

What Exactly Is a Binary File?

Think of files like books written in different languages. A text file is like a book written in English — you can open it with any text editor and read it directly. The words make sense because they use the same alphabet you know.

A binary file, on the other hand, is like a book written in an ancient hieroglyphic system. You can’t just open it with a text editor and expect to understand what you see. Instead, you need special tools that know how to decode those symbols.

In technical terms: Binary files store data as sequences of bytes (numbers from 0 to 255) that represent anything from pixels in an image to audio frequencies to compressed data. They’re the most efficient way computers store complex information.

A binary file is like a long row of tiny numbered boxes (bytes).
Each box holds a number between 0 and 255.
Computers use these numbers as instructions to rebuild things like pictures, sounds, or videos.
It’s the fastest and most space-saving way for a computer to remember complex things.

Real-World Examples

Here are binary files you encounter every day:

Images: .jpg, .png, .gif — each byte might represent a pixel's color
Videos: .mp4, .avi — sequences of compressed video frames and audio
Documents: .pdf, .docx — formatted documents with embedded images and fonts
Audio: .mp3, .wav — sound wave data encoded as numbers
Executables: .exe, .app — programs your computer can run

Why Can’t You Just Read Them Like Text?

Let’s do a quick experiment. If you open a JPEG image in a text editor, you might see something like this:

ÿØÿàJFIFÿÛC��������������������

That gibberish happens because your text editor is trying to interpret raw bytes as letters. It’s like trying to read sheet music as if it were a novel — the symbols mean something, just not what you’re expecting them to mean.

How JavaScript Reads Binary Files

JavaScript gives us several powerful tools to work with binary data. Let’s explore them step by step.

The ArrayBuffer: Your Binary Data Container

An ArrayBuffer is like a raw storage box for binary data. It's just a fixed-length sequence of bytes sitting in memory.

// Create a buffer that can hold 8 bytes
const buffer = new ArrayBuffer(8);
console.log(buffer.byteLength); // 8

Think of an ArrayBuffer as a row of 8 boxes, each capable of holding one byte (a number from 0 to 255). But here’s the catch: you can’t directly put data into an ArrayBuffer. You need a “view” to interact with it.

Views: Looking at Your Data Different Ways

This is where it gets interesting. The same binary data can be interpreted in different ways, just like how the number “1000” could mean 1000 dollars, 1000 meters, or 10:00 on a clock depending on context.

JavaScript provides different “views” to read the same ArrayBuffer:

const buffer = new ArrayBuffer(8);

// View it as 8-bit unsigned integers (0-255)
const uint8View = new Uint8Array(buffer);
uint8View[0] = 255;
// View the SAME buffer as 16-bit integers
const uint16View = new Uint16Array(buffer);
console.log(uint16View[0]); // Reads the first 2 bytes together

Common views include:

Uint8Array — treats each byte as a number 0-255
Int16Array — treats pairs of bytes as numbers from -32,768 to 32,767
Float32Array — interprets 4 bytes as decimal numbers
DataView — lets you read different types from anywhere in the buffer

Reading a Real Binary File in JavaScript

Now let’s put this knowledge into practice. Here’s how you’d read an image file in the browser:

// HTML: 
document.getElementById('fileInput').addEventListener('change', async (event) => {
  const file = event.target.files[0];
  
  // Read the file as an ArrayBuffer
  const arrayBuffer = await file.arrayBuffer();
  
  // Create a view to examine the bytes
  const bytes = new Uint8Array(arrayBuffer);
  
  // Look at the first few bytes (the "magic number")
  console.log('First 4 bytes:', 
    bytes[0], bytes[1], bytes[2], bytes[3]);
  
  // PNG files start with: 137, 80, 78, 71
  // JPEG files start with: 255, 216, 255
  if (bytes[0] === 255 && bytes[1] === 216) {
    console.log('This is a JPEG image!');
  }
});

The Magic Number Trick

Professional tip: most binary files start with a “magic number” — specific bytes that identify the file type. It’s like how books have ISBN numbers. By checking the first few bytes, you can determine what kind of file you’re dealing with.

A Practical Example: Building an Image Analyzer

Let’s create something useful — a tool that tells you basic information about an uploaded image:

async function analyzeImage(file) {
  const buffer = await file.arrayBuffer();
  const bytes = new Uint8Array(buffer);
  
  // Determine file type
  let fileType = 'Unknown';
  if (bytes[0] === 255 && bytes[1] === 216) {
    fileType = 'JPEG';
  } else if (bytes[0] === 137 && bytes[1] === 80) {
    fileType = 'PNG';
  } else if (bytes[0] === 71 && bytes[1] === 73) {
    fileType = 'GIF';
  }
  
  return {
    name: file.name,
    size: `${(file.size / 1024).toFixed(2)} KB`,
    type: fileType,
    totalBytes: bytes.length
  };
}

// Usage:
const info = await analyzeImage(myFile);
console.log(info);
// Output: { name: "photo.jpg", size: "245.32 KB", type: "JPEG", totalBytes: 251208 }

Reading Binary Data from the Web

You can also fetch binary files from URLs:

async function downloadBinaryFile(url) {
  const response = await fetch(url);
  const arrayBuffer = await response.arrayBuffer();
  const bytes = new Uint8Array(arrayBuffer);
  
  console.log(`Downloaded ${bytes.length} bytes`);
  return bytes;
}

// Download an image
const imageData = await downloadBinaryFile('https://example.com/photo.jpg');

Converting Between Formats

Sometimes you need to convert binary data to other formats:

// Binary to Base64 (useful for embedding images in HTML/CSS)
function binaryToBase64(bytes) {
  let binary = '';
  for (let i = 0; i < bytes.length; i++) {
    binary += String.fromCharCode(bytes[i]);
  }
  return btoa(binary);
}

// Base64 to Binary
function base64ToBinary(base64) {
  const binary = atob(base64);
  const bytes = new Uint8Array(binary.length);
  for (let i = 0; i < binary.length; i++) {
    bytes[i] = binary.charCodeAt(i);
  }
  return bytes;
}

Common Use Cases

Here’s when you’ll actually need to work with binary files:

File uploads: Reading files users select from their computer
Image processing: Manipulating pixels, applying filters, or converting formats
PDF generation: Creating documents programmatically
Audio/Video processing: Working with media files
Data compression: Creating zip files or compressing data
Cryptography: Encrypting and decrypting data
Network protocols: Sending/receiving binary data over WebSockets

Key Takeaways

Binary files are everywhere, and understanding how to work with them opens up a world of possibilities in JavaScript:

Binary files store data as raw bytes, not human-readable text
ArrayBuffer is your container for binary data
Typed arrays (like Uint8Array) let you read and manipulate those bytes
Different views interpret the same bytes in different ways
File magic numbers help identify file types
The File API and Fetch API make reading binary data straightforward

Where to Go from Here

Now that you understand the basics, you can explore:

Using libraries like pdfkit for PDF generation
Working with the Canvas API to manipulate image pixels
Exploring WebGL for 3D graphics (heavily reliant on binary data)
Building file converters or image processors
Learning about file compression algorithms

Binary files might have seemed mysterious at first, but they’re just another way of organizing data — and now you have the tools to read them. Happy coding!

Have questions about working with binary files in JavaScript? Drop them in the comments below!

Building RAG from Scratch: Understanding AI’s Knowledge Retrieval Without the Black Boxes

Patric — Sun, 30 Nov 2025 22:22:33 GMT

Ever wondered how ChatGPT or Claude can answer questions about your specific documents? The secret isn’t magic – it’s Retrieval-Augmented Generation (RAG). And if you’ve ever felt lost in a sea of abstraction when trying to understand it, I’ve got good news: you can build it yourself, from scratch, and actually understand what’s happening under the hood.

I just published rag-from-scratch, an open-source educational project that demystifies RAG by walking you through building it step by step, with no cloud APIs, no black boxes – just clear explanations and local code you can run and understand.

The Problem with „Just Use This Framework“

Most RAG tutorials follow a familiar pattern: import a framework, call a few functions, and voilà – magic happens. But what actually happened? How do embeddings work? Why does the retrieval sometimes fail? How can you debug or improve it?

The philosophy behind this project is simple: if you can explain it, you can build it. If you can build it, you can improve it. This is the same approach I took with my previous project, AI Agents from Scratch, which helped developers understand agentic AI by building it themselves.

What You’ll Actually Learn

RAG isn’t rocket science, but it involves several moving parts that work together:

The Core Pipeline:

Knowledge Requirements – Define what questions you need to answer and what data you need
Data Loading – Import and structure your documents
Text Splitting – Divide documents into manageable chunks
Embedding – Convert text into numerical vectors that capture meaning
Vector Store – Index embeddings for fast similarity search
Retrieval – Fetch the most relevant context for a query
Re-Ranking – Improve precision by reordering results
Augmentation – Merge retrieved context into the LLM’s prompt
Generation – Produce grounded answers using a local LLM

Each step is crucial, and each step is demystified in this repository.

A Learning Path That Actually Works

The repository is structured as a progressive learning journey. You don’t start by building a production system – you start by understanding the fundamentals:

How RAG Really Works

Before touching embeddings or vector databases, you’ll see RAG in action with a minimal simulation in under 70 lines of code. This uses naive keyword search to demonstrate the core concept: retrieve context, then generate an answer. It’s simple, but it crystallizes the fundamental idea.

Understanding Embeddings

Instead of treating embeddings as a black box, you’ll learn the math behind them. How does „king – man + woman ≈ queen“ actually work? What is cosine similarity, and why does it matter? You’ll implement text similarity from scratch before using any libraries.

Building Your Own Vector Store

You’ll build an in-memory vector store that actually stores embeddings and performs nearest-neighbor search. No magic – just arrays, distance calculations, and indexing logic you can see and understand.

Advanced Retrieval Strategies

Once you understand the basics, you’ll level up with techniques that dramatically improve results:

Query preprocessing and normalization
Hybrid search strategies
Multi-query retrieval
Post-retrieval re-ranking to reduce noise

Each example includes three things:

Working code (`example.js`)
2. A detailed code explanation (`CODE.md`)
3. A conceptual explanation (`CONCEPT.md`)

Nothing is hidden. Every function is explained. Every concept is broken down.

Why Local? Why No Cloud APIs?

This project runs entirely on your machine using local LLMs (via `node-llama-cpp`). Why?

True Understanding – When you run code locally, you can debug it, inspect it, and truly understand what’s happening at each step
No Costs – Experiment freely without worrying about API bills
Privacy – Your documents never leave your machine
Complete Control – Modify, extend, and customize every component

This isn’t about building the fastest or most scalable RAG system. It’s about building understanding.

What’s Available Now (and What’s Coming)

The repository is actively being developed with an educational-first approach. Currently available:

✅ Core concepts (how RAG works, LLM basics)

✅ Data loading and text splitting.

✅ Embeddings and similarity.

✅ Vector store implementation.

✅ Basic retrieval strategies.

Coming soon:

🚧 Advanced retrieval techniques.

🚧 Prompt engineering for RAG.

🚧 Evaluation metrics.

🚧 Graph database integration.

🚧 Production-ready templates.

Each topic will be added thoughtfully, with the same commitment to clarity and depth.

Why This Matters

We’re in an era where AI is rapidly becoming commoditized through APIs and frameworks. That’s powerful, but it creates a generation of developers who can use AI without truly understanding it. When things break (and they will), or when you need to optimize for your specific use case, that understanding becomes critical.

RAG is one of the most practical applications of LLMs today – it’s how we make AI useful for real-world knowledge tasks. Understanding how it works, not just how to call it, makes you a better AI engineer.

Try It Yourself

Getting started is simple:

git clone https://github.com/pguso/rag-from-scratch.git

cd rag-from-scratch

npm install

node examples/00_how_rag_works/example.js

Join the Journey

This project is open source and welcomes contributions. If you have a clear, educational example or improvement, pull requests are encouraged. The goal is to build the best educational resource for understanding RAG, one example at a time.

RAG doesn’t have to be a black box. You can understand it. You can build it. And once you do, you’ll be equipped to improve it, debug it, and adapt it to your needs.

Check out the repository: https://github.com/pguso/rag-from-scratch

If you can explain it, you can build it. If you can build it, you can improve it.

Let’s demystify RAG together.

Every AI Agent Tutorial Skips the Fundamentals. So I Built Them.

Patric — Mon, 27 Oct 2025 07:53:06 GMT

Four days ago, I published a GitHub repository. I expected maybe a few stars, some polite feedback. Instead, it exploded to over 700 stars, hundreds of upvotes on Reddit, and emails from developers saying “this is exactly what I needed.”

This isn’t a story about going viral. It’s about the lonely, frustrating journey that led there — and why sometimes the hardest path teaches you the most.

The Problem Nobody Talks About

Here’s what no one tells you when you start learning about AI agents:

You can follow every tutorial. Copy-paste every code snippet. Get everything working. And still have absolutely no idea what you’re doing.

I was that person. For months.

Every resource I found did the same thing — jumped straight into LangChain or CrewAI or some other framework. “Just use this library,” they’d say. “Look how easy it is!” And it was easy. Until something broke.

Then I was completely lost.

Was it the framework? The prompt? The model? The way I structured my code? I had no mental model. No understanding of what was actually happening under those nice, clean APIs.

I couldn’t debug. I couldn’t customize. I couldn’t build anything beyond what the tutorials showed me.

The Breaking Point

The moment I decided to change everything was when I spent three hours debugging an agent that wouldn’t use a tool properly. Three hours of tweaking prompts, reading documentation, checking GitHub issues.

I never figured it out. I just tried a different framework and hoped for the best.

That’s when I realized: I didn’t want to just use AI agents. I wanted to understand them.

Starting from Zero

So I did something that probably seems obvious in hindsight: I started over. From scratch.

No LangChain. No CrewAI. No frameworks at all.

Just me, node-llama-cpp, local models, and a lot of documentation reading.

The first few weeks were brutal. Without the framework abstractions, I had to figure out everything:

How does the model actually receive function definitions?
What format does function calling really use?
How does memory work at a fundamental level?
What is the ReAct pattern actually doing?

I made a spreadsheet of every agent concept I wanted to understand. Then I built tiny, focused examples for each one. No fancy features. No production-ready code. Just the absolute minimum needed to understand the concept.

Example 1: A basic LLM call. That’s it.
Example 2: System prompts and specialization.
Example 3: Streaming responses.

Each one built on the last. Each one forced me to understand one more piece of the puzzle.

The “Aha!” Moments

Around week six, something shifted.

I was building an example for function calling when it finally clicked. Function calling isn’t magic. It’s just structured output. The model returns JSON that matches a schema you provide, and you parse it and execute code.

That’s it. That’s the whole thing.

But understanding that one simple fact changed everything. Suddenly I could debug function calling issues. I could customize the behavior. I could see why frameworks did things certain ways.

Then ReAct patterns made sense. Memory systems made sense. Tool chaining made sense.

It was like learning to read. Once you understand the fundamentals, everything else is just combinations of things you already know.

Why I’m Sharing This

After months of trial and error, I had dozens of examples. Some were dead ends. Some were too complex. Some taught me a lot but wouldn’t help anyone else.

So I did something harder than building: I curated.

I picked eight examples that formed a perfect learning path. Each one focused on a single fundamental concept. Each one built naturally on the last. Each one I polished until the code was as clear as I could make it — not production-ready, but teaching-ready.

Plain JavaScript. No framework magic. Just the concepts you absolutely need to understand.

I almost didn’t publish it. “Who would want this?” I thought. “Everyone just uses frameworks.”

But I remembered how lost I felt. How every tutorial assumed knowledge I didn’t have. How desperately I wanted someone to just explain the fundamentals without jumping to abstractions.

So I published those eight examples, hoping they’d be a starting point. My vision wasn’t to create the definitive resource — it was to plant a seed. Let the community add or ask for examples where they see gaps. Let it evolve into the resource developers actually need to understand agents deeply before they jump into frameworks.

A living tutorial, shaped by the people learning from it.

So I put it on GitHub: ai-agents-from-scratch

What Happened Next

The response shocked me.

Within four days:

734 GitHub stars
76 forks
495 upvotes on Reddit
Dozens of comments from developers saying “this is exactly what I needed”, E-Mails, LinkedIn contacts.

But the message that meant the most came from a team lead who went through the entire tutorial and sent me detailed feedback on every example. He’s sharing it with his team.

That’s when I knew: I wasn’t alone in my frustration.

There are thousands of developers who want to understand AI agents, not just use them. Who want to know what’s happening under the hood. Who learn by building.

What You’ll Actually Learn

The repository covers eight progressive examples:

Basic LLM interaction — Understanding the foundation
System prompts — Making specialized agents
Streaming — Handling real-time responses
Translation agent — Applying concepts to real tasks
Function calling — The core of agent behavior
Batch processing — Handling multiple tasks efficiently
ReAct agent — The reasoning and acting pattern
Memory systems — Making agents remember context

Everything runs locally. You need Node.js and a GGUF model (I use Qwen 1.7B, which runs on modest hardware). No API keys. No cloud costs. Just you and the fundamentals.

Each example includes:

Heavily commented code that explains every decision
Concept explanations that connect to the bigger picture
Suggestions for experimentation and extension

The Philosophy

Here’s what makes this different from other tutorials:

No frameworks. You see exactly what’s happening at every step. No black boxes.

Progressive complexity. Each example introduces one new concept. No overwhelming you with everything at once.

Local-first. Run everything on your machine. Experiment without worrying about costs or rate limits.

Explanation over efficiency. The code isn’t optimized. It’s optimized for understanding.

Who This Is For

You might find this useful if:

You’ve used LangChain but don’t understand what it’s doing
You want to build custom agents but don’t know where to start
You’re tired of tutorials that skip the fundamentals
You learn by actually building things
You want to know why patterns work, not just that they work

What I Learned About Learning

Building this taught me something important about technical education:

Sometimes the fastest path to understanding is the slowest path to results.

Frameworks are amazing. They let you build complex systems quickly. But if you start with frameworks, you’re building on a foundation you don’t understand.

When you build from scratch — even if it’s harder, even if it takes longer — you develop intuition. You understand trade-offs. You can debug. You can customize. You can innovate.

The irony is that now I use frameworks all the time. But I use them differently. I know when to lean on them and when to go around them. I can read their source code and understand what’s happening. I can contribute improvements.

That only happened because I took the time to understand the fundamentals.

Start Here

If you’re ready to really understand AI agents, here’s what I’d suggest:

Clone the repository
Download a small GGUF model (instructions included DOWNLOAD.md)
Start with intro.js and work through the examples in order
Don’t rush. Take time to modify and experiment with each one
Break things. That’s how you learn what’s actually happening.

The journey from confusion to clarity isn’t quick. But it’s worth it.

A Final Thought

When I started this journey, I thought I was alone in my frustration. The response to this repository showed me I wasn’t.

There are thousands of us trying to understand this technology deeply. Not just to use it, but to build on it. To push it forward. To know how it really works.

If you’re one of those people, this is for you.

Let me know what you build.

Find the repository at: github.com/pguso/ai-agents-from-scratch

Questions? Issues? Contributions? The repo is actively maintained and I read every comment.

Still Using Google Colab? It’s Time to Grow Up

Patric — Fri, 03 Oct 2025 14:08:19 GMT

Look, we need to talk. I know Google Colab was there for you when you were just starting out. It was free, it was simple, and it gave you your first taste of GPU computing without asking for a credit card. That’s beautiful. Really.

But let’s be honest: you’re not a beginner anymore. You’re doing serious work now. And yet, you’re still sitting there, refreshing your browser every 30 minutes to keep your session alive, praying that your training run doesn’t get preempted at hour 11 of your 12-hour limit. You’re managing a spreadsheet to track your “compute units” like some kind of medieval currency exchange. You’re emailing .ipynb files to your teammates like it's 2015.

It’s time to meet Modal Notebooks. And no, this isn’t just another cloud notebook. This is what happens when someone actually thinks about developer experience in 2025.

The Reality Check: What You’re Actually Paying

Let’s start with the uncomfortable truth about Google Colab’s pricing. Because once you scratch beneath that “free tier” marketing veneer, things get… interesting.

Google Colab’s Compute Unit Shell Game

Colab charges you in “compute units” that cost $0.10 each, bundled in packs of 100 for $10. The T4 GPU burns through 1.96 units per hour, the V100 uses 5 units per hour, and the A100 demolishes 15 units per hour. But here’s the kicker: even installing Python libraries consumes your compute units. Yes, you read that right. Setting up your environment costs money.

Let’s do the math for a typical research workflow:

Google Colab Pro ($10/month + compute units):

Base subscription: $9.99/month
100 compute units included
T4 GPU: ~51 hours maximum (but good luck getting guaranteed access)
V100 GPU: ~20 hours maximum
A100 GPU: ~6.7 hours maximum
After you burn through units: Pay another $10 for 100 more
Reality check: Environment setup, idle time, and failed experiments all eat your units

Effective hourly rates once your “free” units run out:

T4: ~$0.20/hour (but availability not guaranteed)
V100: ~$0.50/hour
A100: ~$1.50/hour

Modal: Actual Transparent Pricing

Modal gives you $30 in free compute credits every month, and then you pay only for the exact compute you use, measured per second. No subscriptions. No compute unit conversion charts. No PhD in pricing models required.

Modal Notebooks (Pay for what you use):

$30 free credits monthly (that’s triple Colab’s effective credits)
T4 GPU: $0.59/hour ($0.000164/second)
L4 GPU: $0.80/hour ($0.000222/second)
A10G GPU: $1.10/hour ($0.000306/second)
A100 (40GB): $2.10/hour ($0.000583/second)
A100 (80GB): $2.50/hour ($0.000694/second)
H100: $3.95/hour ($0.001097/second)

The difference? You only pay when your kernel is actually running. No zombie sessions draining your wallet. When you stop, you stop paying. Immediately.

The Cost Comparison: Real World Scenarios

Scenario 1: The Weekend Warrior

You’re fine-tuning a model over the weekend. Let’s say 20 hours on an A100.

Colab Pro:

Base subscription: $9.99
Compute units needed: 300 units (20 hours × 15 units/hour)
Cost: $9.99 + $30 = $39.99
Plus you spent units just installing dependencies

Modal:

20 hours on A100 (40GB): $2.10 × 20 = $42.00
But you have $30 in free credits = $12.00 total
And you didn’t pay for setup time or idle sessions

Scenario 2: The Production Researcher

You’re running experiments across different GPU types, switching as needed for optimal cost/performance.

Colab Pro:

You’re locked into whatever GPU they give you
T4 availability under Colab Pro is not guaranteed, often necessitating costlier alternatives
You can’t easily switch mid-session
You’re paying compute units while you context-switch between notebooks

Modal:

Switch GPU types in under 5 seconds
Run T4 for data preprocessing: $0.59/hour
Switch to H100 for training: $3.95/hour
Drop back to L4 for inference testing: $0.80/hour
Pay only for what each job actually needs

Scenario 3: The Team Player

Your research team of 4 people needs to collaborate on model development.

Colab Pro:

4 × $9.99 = $39.96/month in subscriptions
4 × 100 compute units = 400 units = ~26 hours on A100 total
Collaboration means emailing notebooks back and forth
Everyone maintains their own environment
Total chaos when someone’s units run out mid-week

Modal:

True collaborative editing with multiple cursors and live edits, like Google Docs
4 × $30 = $120 in free credits monthly
That’s 48 hours of A100 time free, or 203 hours of T4
Shared Volumes, Secrets, and Functions across the entire team
Everyone sees the same environment, instantly

Beyond Pricing: Why Modal Actually Works Better

Cold Start Times That Don’t Make You Age

Modal Notebooks boot in under 5 seconds, even with custom container images and GPU allocation. Five. Seconds.

Colab? Cloud instances can take minutes to spin up. You know the drill: click “Connect,” go make coffee, come back, hope it worked.

No More Session Babysitting

Colab free tier limits sessions to 12 hours maximum, and even paid tiers enforce limits. You’re constantly watching the clock, manually saving checkpoints, praying your training completes.

Modal kernels auto-idle and resume, so you only pay when they’re actually running. Close your laptop. Go home. Come back tomorrow. Your work is right where you left it, and you didn’t pay a cent while you were gone.

Real Collaboration, Not File Tennis

Be honest: how do you currently share notebooks with your team? Email? Slack? Google Drive? Then someone makes changes, you make changes, now there are three versions floating around, and nobody knows which one has the latest results?

Modal offers true real-time collaboration where multiple users can edit and run cells simultaneously, seeing each other’s cursors in real-time. It’s 2025. Your collaboration tools should work like Google Docs, not like sneakernet.

The Path to Production Isn’t a Rewrite

Here’s the thing that really matters: Modal Notebooks integrate with the same Volumes, Secrets, and deployed Functions as your production Modal Apps. Your notebook experiment isn’t some isolated sandbox that you’ll need to completely rewrite to deploy. It’s already running on the same infrastructure as your production code.

That model you just trained? Export it to a Modal App with one click. Now it’s a production API endpoint. No translation needed.

GPUs That Actually Exist

Modal lets you scale up to 8× H100 or B200 GPUs. When’s the last time you got access to cutting-edge hardware on Colab? The types of GPUs available in Colab vary over time, and premium GPUs are subject to availability. Translation: you might get a T4 when you paid for something better.

Modal? You pick the GPU. It spins up. Every time.

The Bottom Line

Google Colab was revolutionary when it launched. It democratized access to GPU computing and helped countless students and researchers get started with deep learning. That’s genuinely wonderful.

But “good for getting started” isn’t the same as “good for serious work.”

You’ve outgrown it. Your projects are more complex. Your deadlines are real. Your team needs to collaborate. Your experiments need reproducibility. Your models need a path to production.

Modal gives you $30 in free compute credits monthly — more than Colab’s effective free tier. After that, you pay transparent per-second pricing with no unit conversion gymnastics. You get sub-5-second cold starts. You get real-time collaboration. You get a direct path from research to production. You get to actually choose your hardware.

Most importantly: you get to stop babysitting sessions, stop tracking compute units in spreadsheets, and stop apologizing to your teammates about “sorry, my session died again.”

It’s time to grow up. Your compute environment should too.

Ready to make the switch? Get started with Modal Notebooks at modal.com/notebooks and use your $30 in free monthly credits to see the difference yourself. Your future self will thank you.

Understanding Attention Mechanisms: The Secret Sauce Behind Modern AI

Patric — Fri, 03 Oct 2025 13:07:35 GMT

A Step-by-Step Guide to Self-Attention with Working Code

If you’ve ever wondered how ChatGPT and other large language models understand context so well, the answer lies in a deceptively simple yet powerful concept: attention mechanisms.

The best part? You can build the core mechanism from scratch in just 50 lines of code.

Let’s dive into how this breakthrough technology works — and we’ll even code it from scratch using PyTorch.

Before You Begin

You’ll need:

Basic Python (reading code, understanding variables and functions)
High school math (what vectors are, basic multiplication)

You don’t need:

Deep learning experience
PyTorch knowledge

The Challenge: Understanding Long Conversations

Imagine you’re at a dinner party, listening to someone tell a long, winding story about their vacation. As they talk about what happened on day seven, you need to remember details from day one to understand the full picture. Now imagine trying to compress that entire story into a single mental note — you’d lose crucial details, especially from the beginning.

This was exactly the problem early AI systems faced when processing language.

The Old Approach: Encoder-Decoder Architecture

Before transformers took over, recurrent neural networks (RNNs) were the go-to technology for tasks like language translation. Here’s how they worked:

The Encoder would read an input sentence word by word — let’s say “The restaurant serves incredible pasta” — building understanding as it went. By the final word, it compressed the entire meaning into a single summary vector.

The Decoder would take this compressed summary and generate the output, producing one word at a time in the target language.

Think of it like this: reading an entire news article (encoding), then explaining it to a friend (decoding) based solely on what you remember.

The Fatal Flaw

Here’s the problem: squashing an entire sentence into one vector is like trying to capture a symphony in a single musical note. You inevitably lose information — particularly from earlier parts of longer sequences.

For short sentences like “Hello, how are you?” this worked fine. But for complex sentences with multiple clauses and subtle meanings? The model would forget important context by the time it reached the end.

The key insight: RNNs had a fundamental bottleneck, and this limitation sparked one of AI’s biggest breakthroughs.

The Breakthrough: Self-Attention

In 2017, researchers introduced the transformer architecture with a revolutionary idea: what if we didn’t need to compress everything into one vector at all?

Instead of forcing words through a narrow bottleneck, self-attention allows every word to directly interact with every other word in the sequence. Each word can “look at” and gather information from all the others simultaneously.

A Practical Example

Consider the sentence: “The chef prepared the meal because she loved cooking.”

When processing the word “she,” a human immediately knows it refers to “chef.” Self-attention gives AI this same ability — the word “she” can directly “attend to” or focus on “chef” to understand the connection.

This doesn’t just work for pronouns. Every word examines every other word to understand:

Which words are most relevant to its meaning
How it relates to the overall sentence structure
What context it needs to be properly understood

Why “Self” Attention?

The “self” in self-attention is important — it means the mechanism examines relationships within a single sequence.

Self-attention: Looking at words within one sentence (like “The chef prepared the meal”)

Regular attention: Comparing words between two different sequences (like matching English words to their French translations)

For building language models that predict the next word, self-attention is what we need.

Coding Self-Attention From Scratch

Let’s implement a simple version of self-attention to see exactly how it works. We’ll use the sentence:

“Music brings people pure joy”

Each word will be represented as a 3-dimensional embedding vector (in real models, these are typically 768 or more dimensions).

Step 1: Set Up the Input Embeddings

import torch

# Input embeddings: each row represents one word
inputs = torch.tensor(
    [[0.21, 0.45, 0.78],  # Music   (x^1)
     [0.63, 0.29, 0.91],  # brings  (x^2)
     [0.48, 0.72, 0.34],  # people  (x^3)
     [0.85, 0.19, 0.56],  # pure    (x^4)
     [0.37, 0.88, 0.42]]  # joy     (x^5)
)
print("Input shape:", inputs.shape)  # torch.Size([5, 3])

Each row is a word’s embedding — a vector of numbers that represents its meaning in a high-dimensional space.

Step 2: Calculate Attention Scores for One Query Word

Let’s focus on the word “brings” and see how much it should attend to each word in the sentence.

# Select "brings" as our query word (index 1)
query_word = inputs[1]

# Calculate attention scores: dot product with all words
attention_scores = torch.matmul(inputs, query_word)
print(f"Attention scores for 'brings':")
print(attention_scores)

# Output: tensor([0.9303, 1.2695, 1.0326, 1.1064, 1.1779])

The dot product is a way to measure how similar or aligned two vectors are by multiplying their corresponding numbers together and adding up all those products — resulting in a single number where higher values mean the vectors point in more similar directions.

What’s happening here?

For each word, we compute the dot product:

Music: (0.21 × 0.63) + (0.45 × 0.29) + (0.78 × 0.91) = 0.9303
brings: (0.63 × 0.63) + (0.29 × 0.29) + (0.91 × 0.91) = 1.2695
people: (0.48 × 0.63) + (0.72 × 0.29) + (0.34 × 0.91) = 1.0326
pure: (0.85 × 0.63) + (0.19 × 0.29) + (0.56 × 0.91) = 1.1064
joy: (0.37 × 0.63) + (0.88 × 0.29) + (0.42 × 0.91) = 1.1779

Higher scores indicate stronger relationships. Notice “brings” has the highest score with itself (1.2695), which makes sense!

Step 3: Normalize Scores into Attention Weights

Raw scores aren’t very useful — we need to convert them into probabilities that sum to 1.0. We use the softmax function for this:

# Convert scores to normalized weights (probability distribution)
attention_weights = torch.softmax(attention_scores, dim=0)

print("Attention weights:")
print(attention_weights)
print(f"Sum of weights: {attention_weights.sum():.4f}")  # Should equal 1.0

# Output:
# tensor([0.1627, 0.2282, 0.1801, 0.1937, 0.2083])
# Sum of weights: 1.0000

Now we have a probability distribution! These weights tell us: “When understanding ‘brings,’ pay 22.8% attention to itself, 20.8% to ‘joy,’ 19.4% to ‘pure,’ and so on.”

Step 4: Create the Context Vector

The context vector is a weighted combination of all input embeddings:

# Compute context vector for "brings"
# This is a weighted sum of all word embeddings
context_vector = torch.matmul(attention_weights, inputs)

print("Context vector for 'brings':")
print(context_vector)

# Output: tensor([0.5084, 0.5056, 0.6006])

This context vector is richer than the original embedding for “brings” because it incorporates information from the entire sentence!

Step 5: Scale to All Words at Once

In practice, we want context vectors for every word simultaneously. We can compute all attention scores in one matrix multiplication:

# Compute attention scores for ALL query-key pairs at once
# Result is a 5×5 matrix where element [i, j] shows how much
# word i should attend to word j
attention_scores = torch.matmul(inputs, inputs.T)

print("Attention scores (all words):")
print(attention_scores)
# Output: 5×5 matrix showing all word-to-word relationships
# Normalize each row to get attention weights
# Each row sums to 1.0, creating a probability distribution per word

attention_weights = torch.softmax(attention_scores, dim=-1)

print("\nAttention weights matrix (5×5):")
print(attention_weights)
print(f"\nSum of each row: {attention_weights.sum(dim=1)}")
# Output: tensor([1., 1., 1., 1., 1.]) - each row sums to 1!
# Compute all context vectors in one operation

all_context_vectors = attention_weights @ inputs

print("\nAll context vectors:")
print(all_context_vectors)
# Output: 5×3 matrix where each row is a word's enriched context vector

What we’ve achieved:

We started with 5 word embeddings (5×3 matrix) and transformed them into 5 context vectors (also 5×3), where each context vector contains information from all words in the sentence, weighted by relevance.

Understanding the Output

Let’s break down what the attention weights matrix tells us:

          Music  brings  people  pure   joy
Music     [0.23,  0.19,  0.20,  0.19,  0.19]
brings    [0.16,  0.23,  0.18,  0.19,  0.21]
people    [0.20,  0.19,  0.21,  0.19,  0.20]
pure      [0.20,  0.19,  0.19,  0.22,  0.20]
joy       [0.18,  0.20,  0.19,  0.19,  0.23]

Each row shows how much that word attends to all other words. For example:

“Music” pays 23% attention to itself and distributes the rest fairly evenly
“brings” pays slightly more attention to itself (23%) and to “joy” (21%)
Each word considers the full sentence context when forming its meaning

Understanding the Context Vectors

Let’s break down what the final context vectors represent:

         Original Input      →  Context Vectors (enriched)
Music:   [0.21, 0.45, 0.78]  →  [0.51, 0.51, 0.60]
brings:  [0.63, 0.29, 0.91]  →  [0.51, 0.51, 0.60]
people:  [0.48, 0.72, 0.34]  →  [0.51, 0.51, 0.60]
pure:    [0.85, 0.19, 0.56]  →  [0.51, 0.51, 0.60]
joy:     [0.37, 0.88, 0.42]  →  [0.51, 0.51, 0.60]

Each context vector is now a blend of all words in the sentence, weighted by their attention scores. For example:

“Music” started as [0.21, 0.45, 0.78] but now incorporates 19% of “brings,” 20% of “people,” and so on
“brings” transformed from [0.63, 0.29, 0.91] into a vector that includes 21% of “joy,” 23% of itself, and 19% of “pure”
Each word’s context vector is no longer isolated — it now carries information about the entire sentence’s meaning

This is the magic of self-attention: words that started with different meanings now have representations that are contextually aware of their neighbors.

The Real-World Impact

This simple mechanism — letting words attend to each other — solved the bottleneck problem that plagued RNNs. Instead of compressing everything into one vector and hoping nothing important gets lost, transformers maintain all information and let the model decide what’s relevant at each step.

What We Left Out (For Now)

This simplified version works, but real transformers add several enhancements:

Query, Key, and Value transformations: Instead of using raw embeddings, we apply learned weight matrices to create specialized query, key, and value vectors
Multiple attention heads: The model can focus on different types of relationships simultaneously (syntax, semantics, etc.)
Scaled dot-product attention: We divide by the square root of the embedding dimension to prevent extremely large values
Causal masking: For language generation, we prevent words from attending to future words

But the core concept remains exactly what we’ve implemented: computing how much each word should attend to every other word, then creating enriched representations based on those attention weights.

The Bottom Line

Self-attention is elegant in its simplicity. With just a few matrix multiplications, we can capture complex relationships between words that would be impossible to encode in a single fixed-size vector.

This breakthrough enabled:

GPT models that can write coherent long-form content
Translation systems that handle complex sentences with ease
AI assistants that maintain context throughout long conversations

Now that you’ve implemented it yourself, you understand the core mechanism behind modern AI. Everything else — deeper architectures, more sophisticated training methods — builds on this foundation.

Want to experiment further? Try modifying the input embeddings or using longer sentences. The beauty of self-attention is that it scales naturally to any sequence length — the same code works whether you have 5 words or 500.

Running LLMs on Modal: GPU-Powered Inference That Scales to Zero

Patric — Mon, 22 Sep 2025 16:28:08 GMT

From expensive 24/7 GPU servers to pay-per-token cloud inference — the serverless revolution hits AI

Running Large Language Models (LLMs) in production has traditionally meant one thing: expensive, always-on GPU servers burning through your budget even when no one’s using them. AWS charges you $3+ per hour for a decent GPU instance whether it’s processing requests or sitting idle. Scale that across multiple models or regions, and you’re looking at thousands of dollars monthly before serving your first user.

What if you could run the same powerful LLMs but only pay for the exact seconds you use them? What if your inference infrastructure could scale from zero to handling hundreds of concurrent requests automatically? And what if setting this up took minutes, not weeks of DevOps work?

That’s exactly what we’ll build in this guide using Modal and llama.cpp. We’ll deploy a production-ready LLM inference service that streams tokens in real-time, automatically scales based on demand, and costs a fraction of traditional GPU hosting.

Why Modal + llama.cpp is a Game Changer

Traditional LLM Hosting Problems:

GPU servers cost $2,000-$5,000+ per month, even when idle
Complex setup with Docker, Kubernetes, load balancers
Manual scaling means either wasted resources or poor performance
Long cold start times when scaling up
Managing CUDA versions, drivers, and dependencies

Our Modal Solution:

Pay only for inference time (seconds, not hours)
Automatic scaling from 0 to infinity
0 to 2,5 second cold starts with GPU snapshots
Zero infrastructure management
Built-in streaming and parallel request handling

Let’s build it.

The Complete LLM Inference Setup

Here’s our full implementation that handles everything from model downloading to streaming inference:

from typing import Optional
from pathlib import Path
import modal

app = modal.App("llms-llama-cpp")
# Configuration
MODEL = "Qwen3-Coder-30B-A3B-Instruct-Q2_K.gguf"
GPU_CONFIG = "A10"
LLAMA_CPP_RELEASE = "b4568"
MINUTES = 60
# CUDA environment setup
cuda_version = "12.4.0"
flavor = "devel"  # includes full CUDA toolkit
operating_sys = "ubuntu22.04"
tag = f"{cuda_version}-{flavor}-{operating_sys}"
# Build the inference image with llama.cpp
image = (
    modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.12")
    .apt_install("git", "build-essential", "cmake", "curl", "libcurl4-openssl-dev")
    .run_commands("git clone https://github.com/ggerganov/llama.cpp")
    .run_commands(
        "cmake llama.cpp -B llama.cpp/build "
        "-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON "
    )
    .run_commands(
        "cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli"
    )
    .run_commands("cp llama.cpp/build/bin/llama-* llama.cpp")
    .entrypoint([])
)

This image setup is doing some heavy lifting:

Starts with NVIDIA’s CUDA development image
Compiles llama.cpp from source with CUDA support
Builds the optimized CLI tools we need for inference

Persistent Model Storage

The key to cost-effective LLM serving is avoiding expensive model re-downloads. Modal Volumes solve this perfectly:

# Persistent storage for our models
model_cache = modal.Volume.from_name("llamacpp-cache", create_if_missing=True)
cache_dir = "/root/.cache/llama.cpp"

download_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install("huggingface_hub[hf_transfer]==0.26.2")
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)
@app.function(
    image=download_image,
    volumes={cache_dir: model_cache},
    timeout=1 * MINUTES
)
def download_model(repo_id, allow_patterns, revision: Optional[str] = None):
    from huggingface_hub import snapshot_download
    print(f"🦙 downloading model from {repo_id} if not present")
    snapshot_download(
        repo_id=repo_id,
        revision=revision,
        local_dir=cache_dir,
        allow_patterns=allow_patterns,
    )
    model_cache.commit()  # persist to Modal Volume
    print("🦙 model loaded")

Why this matters:

Models are downloaded once and shared across all function instances
No bandwidth costs for repeated downloads
Faster cold starts since model weights are already available
Volume persists even when no functions are running (zero cost when idle)

Real-Time Streaming Inference

Here’s where the magic happens — streaming tokens as they’re generated:

@app.function(
    image=image,
    volumes={cache_dir: model_cache},
    gpu=GPU_CONFIG,
    timeout=1 * MINUTES,
)
def llama_cpp_stream(
        prompt: Optional[str] = None,
        model = MODEL,
        n_predict: int = -1
):
    import subprocess

    if prompt is None:
        prompt = "Write a Python function to calculate fibonacci numbers:"
    args = ["--threads", "8"]
    n_gpu_layers = 64  # Use GPU for maximum layers
    command = [
        "/llama.cpp/llama-cli",
        "--model", f"{cache_dir}/{model}",
        "--n-gpu-layers", str(n_gpu_layers),
        "--prompt", prompt,
        "--n-predict", str(n_predict),
    ] + args
    process = subprocess.Popen(
        command,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
        bufsize=1,
        universal_newlines=True
    )
    for line in process.stdout:
        yield line  # Stream each token as it's generated
    process.wait()
    if process.returncode != 0:
        stderr = process.stderr.read()
        raise RuntimeError(f"llama.cpp failed: {stderr}")

This function streams tokens in real-time, giving users immediate feedback rather than waiting for the entire response.

Batch Inference with GPU Snapshots

For maximum performance and cost optimization, we can use Modal’s GPU snapshots:

@app.function(
    image=image,
    volumes={cache_dir: model_cache},
    gpu=GPU_CONFIG,
    timeout=30 * MINUTES,
    enable_memory_snapshot=True,
    experimental_options={"enable_gpu_snapshot": True}
)
def llama_cpp_inference(
        prompt: Optional[str] = None,
        n_predict: int = -1,
):
    import subprocess

    if prompt is None:
        prompt = "Explain quantum computing in simple terms:"
    args = ["--threads", "8"]
    n_gpu_layers = 64
    command = [
        "/llama.cpp/llama-cli",
        "--model", f"{cache_dir}/{MODEL}",
        "--n-gpu-layers", str(n_gpu_layers),
        "--prompt", prompt,
        "--n-predict", str(n_predict),
    ] + args
    print("🦙 running inference...")
    result = subprocess.run(
        command,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True
    )
    if result.returncode != 0:
        raise RuntimeError(f"llama.cpp failed: {result.stderr}")
    return result.stdout

GPU Snapshots are a game-changer:

Save the entire GPU memory state after model loading
Subsequent cold starts are 10x faster (~3 seconds vs 30+ seconds)
Perfect for high-traffic applications
Still pay only for actual usage time

Running Your LLM Service

Let’s create a complete example that demonstrates both streaming and batch inference:

import modal

download_model = modal.Function.from_name("llms-llama-cpp", "download_model")
llama_cpp_stream = modal.Function.from_name("llms-llama-cpp", "llama_cpp_stream")

try:
    # First, ensure the model is downloaded
    download_model.remote(
        repo_id="unsloth/gpt-oss-20b-GGUF",
        allow_patterns=["*Q6_K.gguf"],
    )
    print("✅ Download completed successfully")
except Exception as e:
    if "timeout" in str(e).lower():
        print("⚠️  Download timed out, but model may still be cached")
        print("    Proceeding with inference...")
    else:
        print(f"❌ Download failed: {e}")
        raise

# Streaming inference
print("🚀 Starting streaming inference...")
prompt = "Write a Python function to implement quicksort:"

for chunk in llama_cpp_stream.remote_gen(prompt, n_predict=200):
    print(chunk, end="", flush=True)

print("\n" + "=" * 50)

Run this with:

python llm_inference.py

You can find the final code here.

Cost Analysis: Modal vs Traditional Hosting

Let’s break down the real cost differences:

Traditional GPU Server (AWS p3.2xlarge)

Base Cost: $3.06/hour × 24 hours × 30 days = $2,203/month
Utilization: Typically 10–20% for most applications
Effective Cost: $11,000-$22,000 per month of actual usage
Scaling: Manual, slow, requires load balancers

Modal LLM Inference (Nvidia T4)

Idle Cost: $0 (true serverless)
Inference Cost: ~$0.0001 per second of GPU time
Example Usage: 1000 inferences/day, 3 seconds each
Daily: 1000 × 3 × $0.0001 = $0.30
Monthly: $0.30 × 30 = $9.00
Scaling: Automatic, unlimited parallelism

Real-World Scenario (Nvidia T4)

For a typical AI application serving 10,000 requests per month (averaging 5 seconds each):

Traditional: $2,203+ per month (plus setup/maintenance costs)
Modal: ~$50 per month
Savings: 97% cost reduction

Performance Optimizations

Model Quantization

Use appropriately quantized models for your use case:

# Different quantization levels vs performance trade-offs
QUANTIZATION_OPTIONS = {
    "Q2_K": "Smallest size, fastest inference, good quality",
    "Q4_K_M": "Balanced size/quality",  
    "Q5_K_M": "Better quality, larger size",
    "Q8_0": "Best quality, largest size"
}

GPU Selection

Choose GPUs based on your model size and performance needs:

GPU_CONFIGS = {
    "T4": "Good for smaller models (<7B parameters)",
    "A10G": "Great for medium models (7B-13B parameters)", 
    "A100": "Required for large models (30B+ parameters)"
}

Parallel Processing

Maximize throughput with parallel inference:

@app.local_entrypoint() 
def batch_process():
    prompts = load_prompts_from_file("batch_requests.txt")
    
    # Process in batches of 10 for optimal GPU utilization
    batch_size = 10
    results = []
    
    for i in range(0, len(prompts), batch_size):
        batch = prompts[i:i+batch_size]
        batch_results = list(llama_cpp_inference.map(batch))
        results.extend(batch_results)
        
        print(f"Processed batch {i//batch_size + 1}/{len(prompts)//batch_size}")
    
    return results

Production Considerations

Error Handling and Retries

@app.function(
    image=image,
    volumes={cache_dir: model_cache},
    gpu=GPU_CONFIG,
    retries=3,  # Auto-retry on failures
)
def robust_inference(prompt: str, n_predict: int = 100):
    try:
        return llama_cpp_inference.local(prompt, n_predict)
    except Exception as e:
        print(f"Inference failed: {e}")
        # Log error, potentially fallback to different model
        raise

Monitoring and Logging

@app.function(
    image=image,
    volumes={cache_dir: model_cache},
    gpu=GPU_CONFIG,
)
def monitored_inference(prompt: str):
    import time
    
    start_time = time.time()
    token_count = len(prompt.split())
    
    print(f"Starting inference for {token_count} input tokens")
    
    result = llama_cpp_inference.local(prompt)
    
    duration = time.time() - start_time
    output_tokens = len(result.split())
    
    print(f"Inference completed: {output_tokens} tokens in {duration:.2f}s")
    print(f"Throughput: {output_tokens/duration:.1f} tokens/second")
    
    return {
        "response": result,
        "metrics": {
            "input_tokens": token_count,
            "output_tokens": output_tokens,
            "duration_seconds": duration,
            "tokens_per_second": output_tokens/duration
        }
    }

The Future of LLM Inference

This Modal + llama.cpp setup represents the future of LLM deployment:

Immediate Benefits:

90%+ cost savings compared to traditional hosting
Zero infrastructure management
Automatic scaling and load balancing
Real-time streaming capabilities
Support for multiple models and use cases

Long-term Advantages:

As Modal adds more GPU types, you get access automatically
Performance improvements in llama.cpp benefit your deployment immediately
No vendor lock-in — your code runs anywhere Modal runs
Built-in observability and debugging tools

Getting Started

Ready to deploy your own serverless LLM inference? Here’s your action plan:

Sign up for Modal at modal.com
Install the CLI: pip install modal
Set up authentication: modal token new
Clone the example: Copy the code from this article
Deploy: modal run llm_inference.py

Within minutes, you’ll have a production-ready LLM service that scales automatically and costs a fraction of traditional GPU hosting.

The serverless revolution has finally reached AI inference. Modal makes it possible to run powerful language models with the same ease as calling a function — because that’s exactly what it is.

Ready to stop paying for idle GPUs? Your LLM inference service is just one decorator away.

Modal: AWS Power + GPU Speed = Cloud Computing Unleashed

Patric — Mon, 22 Sep 2025 15:55:27 GMT

From local code to cloud GPUs with just a decorator — no Docker, no DevOps, no headaches

Remember the last time you tried to deploy a simple Python function to AWS? The endless YAML configurations, Docker builds that mysteriously break, IAM roles that make your head spin, and don’t even get started on trying to get a GPU instance running. By the time you’re done, you’ve forgotten what you were trying to build in the first place.

Modal flips this entire experience on its head. What if deploying to the cloud was as simple as adding @app.function() above your Python function? What if you could grab a GPU-powered instance without wrestling with capacity reservations, instance types, or AMI images? What if "going to production" felt more like running a local script than managing a small army of cloud services?

That’s exactly what Modal delivers. It’s cloud computing designed for the modern developer who wants to build AI applications, process data, or scale compute-intensive workloads without becoming a DevOps expert first. While AWS gives you infinite flexibility (and infinite complexity), Modal gives you infinite simplicity with the power you actually need.

In this guide, we’ll show you how Modal transforms cloud development from a multi-day infrastructure project into a five-minute coding session. You’ll see how to go from a simple Python function to a scalable, GPU-accelerated cloud service faster than you can spin up an EC2 instance.

Let’s dive in and see why developers are calling Modal “the cloud platform that actually gets it.”

Setup

Head over to https://modal.com/signup

The quickest way of sign up to Modal is using your existing GitHub or Google account. You will get $5 credit. If you add a payment method you will get every month $30 credit.

You will see the instructions on the screen on how to setup modal to get started, here you can see the text version:

Run this in order to install the Python library locally:

pip install modal

python3 -m modal setup

The first command will install the Modal client library on your computer, along with its dependencies.

The second command creates an API token by authenticating through your web browser. It will open a new tab, but you can close it when you are done.

Introduction

Open your prefered IDE and add a file hello_world.py follow along to get a first running script.

Configure a App that will run on Modal.

It groups one or more Functions for atomic deployment and acts as a shared namespace. All Functions and Classes are associated with an App.

import sys

import modal

app = modal.App("example-hello-world")

A Function runs independently and scales on its own. If it has no live inputs, it won’t use any containers or incur costs, even if its App is still deployed.

Modal lets you run code in the cloud. To get started, write a simple function that logs something to the console. To make it work with Modal, just add the @app.function decorator above it.

@app.function()
def f():
    print("Hello world!")

Running the function locally, in the cloud, and in parallel
We can call the function in three ways:

Locally on your own machine using f.local
Remotely in the cloud using f.remote
In parallel across many inputs in the cloud using f.map

The example below shows how to use locally and remotly inside the main function.

@app.local_entrypoint()
def main():
    # run the function locally
    print(f.local())

    # run the function remotely on Modal
    print(f.remote())

Running with modal run

When you enter modal run hello_world.py in your shell, Modal automatically starts an app, runs the main function, and shows its logs. Alongside these logs, you’ll also see logs from f — first when it runs locally, then remotely, and finally in parallel in the cloud.

This behavior comes from the @app.local_entrypoint decorator on main.

It marks main as the CLI entrypoint for your Modal app.
When you call modal run, Modal knows to start from this function.
Unlike a regular Modal function (which runs only in the cloud), a local_entrypoint runs locally and can orchestrate other Modal functions.

Extra capabilities of local_entrypoint:

Multiple entrypoints: You can define more than one and run them with modal run app_module.py::app.function_name.
Argument parsing: If your entrypoint accepts arguments (str, int, float, bool, datetime), Modal automatically parses CLI options. For example:

@app.local_entrypoint()
def main(foo: int, bar: str):
    some_modal_function.call(foo, bar)

You can run it with:

modal run app_module.py --foo 1 --bar "hello"

You don’t need an explicit app.run(). Modal creates and runs the app for you when you invoke modal run.

Let’s now look at an example to run a function in parallel:

import modal

app = modal.App("example-map")

@app.function()
def f(x):
    print(f"Processing {x}")
    return x * 2

@app.local_entrypoint()
def main():
    inputs = [1, 2, 3, 4]   # the function will run once for each input
    results = list(f.map(inputs))  # runs in parallel in the cloud
    print("Results:", results)

Output:

Processing 1
Processing 2
Processing 3
Processing 4
Results: [2, 4, 6, 8]

Notes:

Each element of inputs is passed to f() as an argument.
map() runs as many functions in parallel as there are inputs (you can have multiple arguments too by passing multiple iterators).
order_outputs=True (default) ensures results are in the same order as inputs; set it to False to get results as soon as each execution finishes.
If you want to handle errors without crashing, use return_exceptions=True.
Use for_each like map to run a function for each input when you don’t need the results, since it automatically waits for all executions to finish.

Modal lets you run code in the cloud as easily as running it locally — no waiting for builds, pushing containers, or switching to a web UI to check logs.

Ephemeral Apps

An ephemeral App is a temporary Modal app that exists only while your script is running. It’s created when you use modal run or app.run(), and stops automatically when the script ends or the client disconnects. You can keep it running after the script ends using --detach.

You can also control logs and output using modal.enable_output().

Example using app.run()

import modal

app = modal.App("example-hello-world")

@app.function()
def f():
    print("Hello world!")
    return "done"

@app.local_entrypoint()
def main():
    with modal.enable_output():  # show logs and progress
        with app.run():  # start ephemeral app
            # run the function locally
            print(f.local())

            # run the function remotely on Modal
            print(f.remote())

Explanation:

app.run() creates the ephemeral app inside your script.
modal.enable_output() makes logs from f() visible in the terminal.
f.local() runs the function on your machine, f.remote() runs it in the cloud.

This is equivalent to running your script with modal run hello_world.py, but now fully controlled from Python.

Configuring Resources and Environment

Now let’s get into the real power of Modal — configuring your cloud environment without the traditional headaches.

Custom Images and Dependencies

Modal uses container images, but you don’t need to write Dockerfiles. Instead, you define your environment programmatically:

import modal

# Create a custom image with your dependencies
image = modal.Image.debian_slim().uv_pip_install(
    "numpy", 
    "pandas", 
    "scikit-learn",
    "torch"
)

app = modal.App("ml-example")

@app.function(image=image)
def train_model(data):
    import numpy as np
    import pandas as pd
    from sklearn.linear_model import LinearRegression
    
    # Your ML code here
    model = LinearRegression()
    # ... training logic
    return "Model trained successfully!"

GPU Configuration Made Simple

Here’s where Modal really shines. Getting GPU access is as simple as adding a parameter:

@app.function(gpu="T4")  # or "A10G", "L40S", etc.
def gpu_accelerated_task():
    import torch
    
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"Using GPU: {torch.cuda.get_device_name(0)}")
        
        # Your GPU-accelerated code here
        tensor = torch.randn(1000, 1000).to(device)
        result = torch.matmul(tensor, tensor)
        
        return f"Computed on {device}"
    else:
        return "No GPU available"

You can find all available GPUs and their prices here.

Working with Secrets and Environment Variables

Modal handles secrets securely without exposing them in your code:

# First, create secrets in the Modal dashboard or CLI
# modal secret create my-api-key API_KEY=your_secret_key

@app.function(secrets=[modal.Secret.from_name("my-api-key")])
def call_external_api():
    import os
    import requests
    
    api_key = os.environ["API_KEY"]
    response = requests.get(f"https://api.example.com/data?key={api_key}")
    return response.json()

Persistent Storage with Volumes

For data that needs to persist between function calls:

vol = modal.Volume.from_name("my-volume")

@app.function(volumes={"/data": vol})
def run():
    with open("/data/xyz.txt", "w") as f:
        f.write("hello")
    vol.commit()  # Needed to make sure all changes are persisted before exit

Why Modal Wins

After working through these examples, the Modal advantage becomes clear:

Traditional Cloud Deployment:

Write Dockerfile
Set up CI/CD pipeline
Configure load balancers
Manage auto-scaling
Monitor resource usage
Handle secrets management
Debug across multiple services
Wait for builds and deployments

Modal Deployment:

Add @app.function() decorator
Run modal run script.py
Done.

Modal abstracts away the infrastructure complexity while giving you access to the full power of cloud computing, including GPUs. You focus on writing Python code that solves your problems, not managing containers and orchestration systems.

For AI developers, data scientists, and anyone building compute-intensive applications, Modal represents a fundamental shift in how we think about cloud development. It’s not just easier — it’s what cloud computing should have been from the beginning.

Ready to experience cloud computing without the pain? Head to modal.com and see what you can build in the next five minutes.

The Three Phases of Open Source AI: From Bigger to Smarter

Patric — Sat, 20 Sep 2025 17:43:17 GMT

How the AI industry learned that size isn’t everything

Imagine if someone told you that a 671-billion parameter AI model could run cheaper than a 70-billion parameter one. Three years ago, that would have sounded impossible. Today, it’s reality — and it represents one of the most fascinating pivots in the history of artificial intelligence.

The story of open source Large Language Models (LLMs) over the past three years isn’t just about technology; it’s about an entire industry learning, adapting, and ultimately discovering that the path to better AI isn’t always “make it bigger.”

What Are LLMs and Why Do Parameters Matter?

Before we dive into the evolution, let’s establish the basics. Large Language Models are AI systems trained on massive amounts of text to understand and generate human-like language. Think of them as incredibly sophisticated autocomplete systems that can write essays, answer questions, and even code.

Parameters are like the “brain cells” of these models — they’re the mathematical weights that determine how the AI processes information. For the longest time, the industry believed in a simple equation: more parameters = smarter AI.

This belief drove what I call the “parameter arms race,” where companies competed to build the biggest models possible. But as we’ll see, this race had an unexpected ending.

📈 Phase 1: The Scaling Race (2022-mid 2024)

Mentality: “More parameters = better performance”

In 2022, the AI world was captivated by a simple idea: bigger is better. OpenAI’s GPT-3 had 175 billion parameters and seemed magical. The logical conclusion? Build even bigger models.

The Giants of the Scaling Era

BLOOM (176B parameters) — July 2022 The first truly massive open-source model, BLOOM was built by a consortium of researchers who pooled resources to create something that could compete with GPT-3. It required enormous computational power and could barely fit on the most powerful hardware available.

LLaMA 1 (up to 65B parameters) — February 2023 Meta’s LLaMA marked a shift toward more efficient scaling, but still followed the “bigger is better” philosophy with models ranging from 7B to 65B parameters.

Falcon (180B parameters) — June 2023 The UAE’s Technology Innovation Institute pushed even further, creating one of the largest dense models of its time.

LLaMA 3.1 (405B parameters) — July 2024 The summit of the scaling race. Meta’s 405B model required massive infrastructure and represented the pinnacle of “brute force” AI scaling.

Nemotron (340B parameters) — June 2024 NVIDIA’s contribution to the arms race, another massive model requiring enormous computational resources.

The Growing Problems

As models grew larger, several critical issues became apparent:

Infrastructure Nightmares: A 405B parameter model needs multiple high-end GPUs just to load into memory, let alone run efficiently. Most companies simply couldn’t afford the hardware.

Astronomical Costs: Running these models for inference (generating responses) cost thousands of dollars per day. Only the biggest tech companies could sustain this.

Deployment Impossibility: Want to run a 400B model on your laptop? Forget it. Even running it in the cloud required specialized, expensive setups.

Diminishing Returns: The performance gains weren’t always proportional to the size increases. A 400B model wasn’t necessarily twice as good as a 200B model.

By mid-2024, it was becoming clear that the scaling race was hitting fundamental limits — not of technology, but of practicality.

🧠 Phase 2: The MoE Revolution (Late 2023–2024)

Mentality: “Smart scaling beats brute scaling”

Just when it seemed like bigger models were the only path forward, a different approach emerged: Mixture-of-Experts (MoE) architecture. This innovation would completely change how we think about model size.

Understanding MoE: The Game Changer

Imagine a university with 8 different professors, each an expert in a specific subject. When a student asks a question, instead of consulting all 8 professors, the university director routes the question to just the 2 most relevant experts. The university has the knowledge capacity of 8 professors but only pays the consultation cost of 2.

That’s essentially how MoE models work. They have multiple “expert” networks, but for any given input, they only activate a subset of them.

The MoE Pioneers

Mixtral 8x7B (47B total, 13B active) — December 2023 Mistral AI’s breakthrough model proved the MoE concept worked in practice. With 8 expert networks of 7B parameters each, it had the capacity of a 47B model but only used 13B parameters for any single prediction. The result? Near-GPT-3.5 performance at a fraction of the computational cost.

DeepSeek-V2 (236B total, 21B active) — May 2024 Chinese AI lab DeepSeek pushed MoE further, creating a model with massive capacity that remained surprisingly efficient to run.

DeepSeek-V3 (671B total, 37B active) — December 2024 The crown jewel of MoE evolution. Despite having more parameters than any model before it, DeepSeek-V3 runs more efficiently than many smaller, traditional models. It’s like having a massive library but only needing to read the relevant books for each question.

The Breakthrough Insight

MoE models revealed a crucial insight: you can separate model capacity from computational cost. Traditional thinking assumed these were locked together — more capacity always meant more compute. MoE proved this wrong.

Suddenly, the parameter count became a misleading metric. DeepSeek-V3’s 671B parameters sounds massive, but it only uses 37B at a time, making it more efficient than traditional 70B models.

⚡ Phase 3: The Efficiency Era (2024–2025)

Mentality: “Right-sized models for real-world deployment”

The third phase represents the maturation of the field. The industry stopped asking “How big can we make it?” and started asking “How efficiently can we solve real problems?”

The Small Model Renaissance

Mistral 7B — The David Among Goliaths Released in September 2023, this 7.3B parameter model shocked the AI world by outperforming much larger competitors. How? Superior training data, better algorithms, and focused optimization. It proved that smart engineering could beat brute force.

Microsoft’s Phi Series — Proving Small Can Be Mighty

Phi-2 (2.7B parameters): Outperformed models 10x its size
Phi-3 (3.8B-14B parameters): Continued the trend of efficient small models

These models showed that with the right approach, you could achieve excellent performance while being deployable on everything from cloud servers to laptops.

Qwen 2.5 — The Complete Spectrum Approach Alibaba’s Qwen series offered models from 0.5B to 72B parameters, recognizing that different applications need different capabilities. A chatbot for customer service doesn’t need the same power as a research assistant.

Why the Industry Pivoted

Several factors drove this shift toward efficiency:

1. Deployment Reality Check

Companies realized that 405B parameter models, while impressive, were impractical for most real-world applications. The infrastructure costs alone made them accessible only to tech giants with massive resources.

The Math Problem: Running a 405B model requires multiple A100 or H100 GPUs, each costing $10,000-$40,000. Most businesses couldn’t justify this expense.

The Accessibility Issue: If only a handful of companies can afford to run your AI model, you’re not democratizing AI — you’re creating an exclusive club.

2. The MoE Breakthrough

MoE models proved that architectural innovation could be more powerful than simply adding parameters. DeepSeek-V3 demonstrated that a 671B parameter model could run more efficiently than traditional 70B models — a complete paradigm shift.

3. Training Efficiency Advances

The industry developed better ways to train models:

Quality Over Quantity: Instead of feeding models more data, researchers focused on higher-quality, more carefully curated datasets.

Improved Techniques: New training methods like better tokenization, improved attention mechanisms, and more efficient architectures allowed smaller models to achieve better results.

Specialized Training: Models began being trained for specific tasks rather than trying to be everything to everyone.

4. Market Demands

Real-world deployment requirements drove the efficiency push:

Edge Computing: Companies wanted AI that could run on phones, tablets, and edge devices, not just massive server farms.

Cost-Conscious Enterprises: Businesses needed AI solutions that fit their budgets, not just their ambitions.

Developer-Friendly Models: Programmers wanted models they could actually experiment with and deploy, not models that required a PhD to operate.

The Current State: Beyond the Parameter Wars

Today’s AI landscape looks radically different from the scaling race of 2022–2023. The winners aren’t necessarily the biggest models, but the smartest ones.

The New Champions

DeepSeek-V3: Represents the peak of MoE efficiency — massive capacity when needed, efficient operation always.

Qwen 2.5: Offers the complete spectrum from ultra-lightweight (0.5B) to flagship (72B), recognizing that one size doesn’t fit all.

Mistral 7B: Continues to punch above its weight class, proving that focused optimization beats raw scaling.

Mixtral Series: Pioneered open-source MoE and continues to push the boundaries of efficient scaling.

The New Success Metrics

The industry has moved beyond simple parameter counting to more meaningful metrics:

Performance per Dollar: How much capability do you get for your compute budget?

Deployment Feasibility: Can real companies actually use this model?

Task-Specific Optimization: How well does it perform on the specific tasks that matter?

Active vs. Total Parameters: For MoE models, what matters is not total capacity but active compute per inference.

Looking Forward: The Post-Scaling Era

The parameter wars taught the AI industry valuable lessons, and the future reflects this newfound wisdom:

Smart Architectures Over Brute Force

The future belongs to innovations like MoE, efficient attention mechanisms, and other architectural advances that maximize capability while minimizing computational requirements.

Right-Sized Models for Specific Use Cases

Instead of one massive model trying to do everything, we’re seeing specialized models optimized for particular tasks — coding assistants, writing helpers, scientific research tools, and more.

Deployment-First Thinking

New models are being designed with real-world deployment in mind from day one, not as an afterthought.

Democratization Through Efficiency

By making models more efficient, the industry is making AI more accessible to smaller companies, researchers, and individual developers.

Why This Evolution Matters

The three-phase evolution of open source LLMs represents more than just technical progress — it’s a story about an industry learning to think differently about innovation.

For Businesses: The shift toward efficiency means AI is becoming more accessible and affordable. You no longer need Google-sized budgets to deploy capable AI systems.

For Developers: Efficient models mean you can experiment, iterate, and deploy AI solutions without massive infrastructure investments.

For Society: More efficient AI means broader access, which leads to more innovation and more diverse applications.

For the Future: The lessons learned from the parameter wars are shaping a more sustainable, practical approach to AI development.

The story of LLM evolution proves that in technology, as in life, bigger isn’t always better — smarter usually is. The parameter wars are over, and the efficiency era has just begun.

The AI industry’s journey from “bigger is better” to “smarter is better” mirrors many technological revolutions. Just as the computer industry moved from room-sized mainframes to powerful laptops, AI is learning that true progress comes not from brute force, but from elegant solutions to real problems.

Key Takeaways

Size isn’t everything: The AI industry learned that more parameters don’t automatically mean better performance
Architecture matters: Innovations like MoE can provide massive capacity improvements without proportional cost increases
Deployment drives design: Real-world constraints shape what models succeed in the market
Efficiency enables access: More efficient models democratize AI by making it accessible to more organizations
The future is specialized: Rather than one massive model for everything, we’re moving toward right-sized models for specific use cases

The three phases of LLM evolution show us that the most important innovations often come not from doing more of the same thing, but from fundamentally rethinking the problem.