<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Patric on Medium]]></title>
        <description><![CDATA[Stories by Patric on Medium]]></description>
        <link>https://medium.com/@pguso?source=rss-b1afabf2359d------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*e2X0-72DfkxgswwPQ5Ay7g.jpeg</url>
            <title>Stories by Patric on Medium</title>
            <link>https://medium.com/@pguso?source=rss-b1afabf2359d------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sat, 20 Jun 2026 11:45:57 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@pguso/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[AI Agents in Production: The Lifecycle Problem Nobody Talks About]]></title>
            <link>https://pguso.medium.com/ai-agents-in-production-the-lifecycle-problem-nobody-talks-about-61117f3e0fcf?source=rss-b1afabf2359d------2</link>
            <guid isPermaLink="false">https://medium.com/p/61117f3e0fcf</guid>
            <category><![CDATA[software-architecture]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[saas]]></category>
            <category><![CDATA[product-development]]></category>
            <dc:creator><![CDATA[Patric]]></dc:creator>
            <pubDate>Sun, 18 Jan 2026 10:11:24 GMT</pubDate>
            <atom:updated>2026-01-18T17:10:34.634Z</atom:updated>
            <content:encoded><![CDATA[<p><em>Bridging the gap between AI agent prototypes and production-ready SaaS systems</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*HA7Ls3aYq2lLJQMFw_ZcXQ.png" /></figure><h3>The Moment Everything Falls Apart</h3><p>You’ve built an AI agent. It works beautifully in your notebook. The demo impresses everyone. Then you try to deploy it to production and suddenly you’re dealing with:</p><ul><li><strong>Users expecting instant responses</strong> (your agent takes 30 seconds)</li><li><strong>Prompt changes breaking everything</strong> (no rollback strategy)</li><li><strong>Costs spiraling</strong> (you’re burning $500/day on a feature 3 people use)</li><li><strong>Support tickets you can’t debug</strong> (“It gave me a weird answer yesterday”)</li></ul><p>The problem isn’t your code. <strong>It’s that agents have a lifecycle that nobody designed for.</strong></p><p>Let me show you what I mean.</p><h3>The Agent Lifecycle (What Actually Happens)</h3><p>Here’s what really happens when a user interacts with your agent in production:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*byj1Dr6neNPUQmKy2Jg_8w.png" /></figure><p><strong>Notice what’s missing from most tutorials:</strong></p><ul><li>Version routing</li><li>Guardrails</li><li>Monitoring that actually helps</li></ul><p>Let’s fix that.</p><h3>Problem #1: Versioning Agents (Not Just Prompts)</h3><p>Here’s what everyone does wrong:</p><pre>// The naive approach<br>const agent = {<br>  prompt: &quot;You are a helpful assistant...&quot;,<br>  model: &quot;gpt-4&quot;,<br>  temperature: 0.7<br>}</pre><p><strong>What happens when you change the prompt?</strong></p><ul><li>Every user gets the new version immediately</li><li>No A/B testing</li><li>No rollback if it breaks</li><li>No idea which version caused the bug</li></ul><h3>The Solution: Treat Agents Like Software Releases</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_29MAvJ5-MdJAD4e3WzStA.png" /></figure><p><strong>Key insight:</strong> Users should be <strong>pinned to a version</strong> until you explicitly migrate them.</p><h3>Problem #2: The Cost-Performance Tradeoff</h3><p>Every agent call has a hidden decision tree:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*sj_27M4EpijjvCxZ7TsA8g.png" /></figure><p><strong>Most tutorials skip this entirely.</strong> But in production:</p><ul><li>60% of requests don’t need GPT-4</li><li>30% could be cached</li><li>10% need the heavy model</li></ul><p><strong>Result:</strong> You can cut costs by 70% with smart routing.</p><h3>Problem #3: Monitoring That Actually Helps</h3><p>When something goes wrong, you need to know:</p><ol><li><strong>Which version</strong> of the agent was used?</li><li><strong>What was the exact prompt</strong> sent to the LLM?</li><li><strong>How much did it cost?</strong></li><li><strong>Did guardrails trigger?</strong></li></ol><h3>The Monitoring Stack</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Ps0lDCTQOmz6GaFitcamMg.png" /></figure><p><strong>Critical:</strong> You need <strong>structured logging</strong> from day one.</p><h3>The Multi-Tenant Challenge</h3><p>If you’re building SaaS, you have an extra problem: <strong>different customers need different agent behavior</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*RTPTduxhLK0dxZDMMWfubA.png" /></figure><p><strong>Key decisions:</strong></p><ul><li><strong>Shared core</strong> vs. <strong>per-customer forks</strong></li><li><strong>Override hierarchy</strong>: Global → Tenant → User</li><li><strong>Isolation</strong>: How do you prevent data leakage?</li></ul><h3>Putting It Together: The Production-Ready Architecture</h3><p>Here’s the full picture:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*iZBOCDePB4kPFvQG-Mfy_A.png" /></figure><h3>What This Means for You</h3><p>If you’re building agents for production, you need to think about:</p><ol><li><strong>Versioning</strong> from day one (not after you break production)</li><li><strong>Cost optimization</strong> as a first-class concern (not an afterthought)</li><li><strong>Observability</strong> that lets you debug what actually happened (not just “it failed”)</li><li><strong>Multi-tenancy</strong> if you’re SaaS (different customers = different configs)</li></ol><p><strong>The good news:</strong> Most of this is just structured thinking. You don’t need fancy tools.</p><p><strong>The bad news:</strong> Not many are teaching this. Most tutorials stops at “here’s how to call OpenAI.”</p><h3>Next Up</h3><p>In <strong>Part 2</strong>, I’ll show you:</p><ul><li>Concrete code for versioned agent execution</li><li>A guardrails layer that actually works</li><li>How to A/B test prompt changes safely</li></ul><p><strong>Part 3</strong> will cover:</p><ul><li>Eval frameworks that run in production</li><li>Rollback strategies when agents break</li><li>Multi-tenant prompt injection (the scary stuff)</li></ul><p><em>Want to see the code? All examples are in the companion repo: </em><a href="https://github.com/pguso/ai-agents-saas-edition"><em>ai-agents-saas-edition</em></a></p><p><em>Building agents in production? I’d love to hear what challenges you’re facing. Connect with me on </em><a href="https://www.linkedin.com/in/patric-gutersohn-466046167/"><em>LinkedIn</em></a><em>.</em></p><p><strong>The core insight:</strong> Agents aren’t functions. They’re services with lifecycles, versions, and SLAs. Treat them like that from day one and you’ll save yourself months of pain.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=61117f3e0fcf" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The GGUF Format Explained: Making AI Models Run Anywhere (Even on Your Laptop)]]></title>
            <link>https://pguso.medium.com/the-gguf-format-explained-making-ai-models-run-anywhere-even-on-your-laptop-30dcb45358da?source=rss-b1afabf2359d------2</link>
            <guid isPermaLink="false">https://medium.com/p/30dcb45358da</guid>
            <category><![CDATA[gguf]]></category>
            <category><![CDATA[llama-cpp]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[hugging-face]]></category>
            <dc:creator><![CDATA[Patric]]></dc:creator>
            <pubDate>Thu, 18 Dec 2025 19:13:17 GMT</pubDate>
            <atom:updated>2025-12-18T19:13:17.434Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*zooL1eTlMhOk7lcWI2CV7w.png" /></figure><p>Ever wondered how people run powerful AI models like Llama on regular laptops without a supercomputer? The secret lies in a clever file format called GGUF. Let’s explore what it is, where it came from, and why it’s revolutionizing how we use large language models.</p><h3>What Problem Does GGUF Solve?</h3><p>Imagine trying to fit an entire library into your backpack. That’s essentially what we’re doing when we try to run modern AI models on regular computers. Models like GPT or Llama can be tens or even hundreds of gigabytes in size, requiring massive amounts of RAM and powerful GPUs to run.</p><p>This is where GGUF comes in. Think of it as a compression technique specifically designed for AI models, similar to how ZIP files compress documents, but much smarter. GGUF doesn’t just shrink the file size — it reorganizes how the model is stored so it can run efficiently on everyday hardware.</p><p><strong>Real-world impact</strong>: With GGUF, a model that normally requires 64GB of RAM and a high-end GPU can run on a laptop with 16GB of RAM and just a CPU. That’s democratizing AI in action.</p><h3>The Origins: From GGML to GGUF</h3><p>To understand GGUF, we need to take a quick trip back to 2022.</p><h3>The GGML Era</h3><p>Developer Georgi Gerganov created GGML (a combination of his initials “GG” and “ML” for machine learning) in late 2022 as a tensor library focused on making AI models accessible on standard hardware. Before building llama.cpp, Gerganov had already proven the concept with whisper.cpp, which brought OpenAI’s Whisper speech-to-text model to consumer devices.</p><p>llama.cpp began development in March 2023 as a pure C/C++ implementation with no dependencies, designed to run on CPUs including smartphones. The project gained massive traction, accumulating over 85,000 stars on GitHub, because it solved a real problem: making powerful AI accessible without specialized hardware.</p><p>However, GGML had limitations. Adding new features often broke compatibility with existing models, and the format lacked flexibility for storing essential metadata like tokenizer information or model-specific parameters.</p><h3>Enter GGUF</h3><p>On August 21st, 2023, the llama.cpp team introduced GGUF (GPT-Generated Unified Format) as a replacement for GGML. This wasn’t just an incremental update — it was a complete redesign addressing GGML’s shortcomings.</p><p>GGUF was designed to be extensible and capable of incorporating new information without breaking compatibility with older models. It combines model parameters with comprehensive metadata in a single binary file, making models truly portable and self-contained.</p><h3>How GGUF Works: The Technical Magic</h3><h3>The Quantization Game</h3><p>At the heart of GGUF’s efficiency is <strong>quantization</strong> — the art of representing numbers with fewer bits while maintaining acceptable accuracy.</p><p>Here’s the concept in simple terms: Imagine you’re an artist with a palette of 16 million colors (standard for digital images). Quantization is like choosing to work with only 256 colors instead. Yes, you lose some nuance, but for many purposes, the result is still excellent — and your artwork takes up far less space.</p><p>In AI models, the weights (the learned parameters that make the model work) are typically stored as 32-bit or 16-bit floating-point numbers. GGUF supports quantization from as low as 2-bit to 8-bit integers, along with standard formats like float32, float16, and bfloat16.</p><p>Here’s what different quantization levels mean in practice:</p><p><strong>Q2_K (2-bit)</strong>: The most aggressive compression, roughly 2.5 bits per weight. Great for testing or when resources are extremely limited, but expect noticeable quality loss.</p><p><strong>Q4_K (4-bit)</strong>: The sweet spot for most users — uses about 4.5 bits per weight. Offers excellent balance between size and quality.</p><p><strong>Q5_K (5-bit)</strong>: Higher quality, slightly larger files. Good when you have a bit more RAM to spare.</p><p><strong>Q8_0 (8-bit)</strong>: Nearly indistinguishable from the original in most cases, but still half the size of 16-bit models.</p><p><strong>F16/F32</strong>: Full precision formats for when quality is paramount and you have the resources.</p><h3>The File Structure</h3><p>A GGUF file consists of four main sections written sequentially: header, metadata key-value pairs, tensor information, and the tensor data itself.</p><p>Think of it like a well-organized filing cabinet:</p><ol><li><strong>Header</strong>: The label on the outside telling you what’s inside and how to open it</li><li><strong>Metadata</strong>: The index cards with all the important information about the model</li><li><strong>Tensor Info</strong>: The catalog listing what’s stored where</li><li><strong>Tensor Data</strong>: The actual files containing the model weights</li></ol><p>This structure includes everything necessary for running a GPT-like language model: tokenizer vocabulary, context length, tensor information, and other attributes.</p><h3>Where GGUF Came From: The llama.cpp Project</h3><p>GGUF is inseparable from llama.cpp, the project that created and maintains it. The creation of GGML was inspired by Fabrice Bellard’s work on LibNC, and the entire effort has been focused on one goal: making AI models work efficiently on consumer hardware.</p><p>The project supports an impressive array of hardware targets: x86, ARM, Metal (for Apple Silicon), CUDA (NVIDIA GPUs), ROCm (AMD GPUs), and more. It uses CPU optimizations like AVX, AVX2, and AVX-512 on Intel/AMD processors, and NEON on ARM devices.</p><p>What started as an experiment in March 2023 has become the foundation for running local AI models worldwide.</p><h3>How GGUF Is Used Today</h3><h3>1. Running Models Locally</h3><p>The most common use case is running large language models on your own computer. Here’s a quick example:</p><pre># Download llama.cpp<br>git clone https://github.com/ggml-org/llama.cpp<br>cd llama.cpp<br><br># Build it<br>cmake -B build<br>cmake --build build --config Release<br><br># Run a model<br>./build/bin/llama-cli -m path/to/model.gguf -p &quot;Hello, world!&quot;</pre><p>That’s it. No cloud services, no API keys, no privacy concerns about your data leaving your machine.</p><h3>2. Converting Models to GGUF</h3><p>You can convert almost any Hugging Face model to GGUF format:</p><pre># Download a model from Hugging Face<br>from huggingface_hub import snapshot_download<br>snapshot_download(repo_id=&quot;meta-llama/Llama-3.2-3B&quot;, local_dir=&quot;model&quot;)<br><br># Convert to FP16 first<br>python llama.cpp/convert_hf_to_gguf.py model --outtype f16 --outfile model-fp16.gguf<br><br># Then quantize to desired level<br>./llama.cpp/build/bin/llama-quantize model-fp16.gguf model-q4.gguf Q4_K_M</pre><h3>3. Desktop Applications</h3><p>Several user-friendly applications have emerged that use GGUF under the hood:</p><ul><li><strong>LM Studio</strong>: A polished GUI for running models on Windows and macOS</li><li><strong>Text Generation WebUI</strong>: A feature-rich web interface with GPU support</li><li><strong>KoboldCpp</strong>: Popular for creative writing and storytelling</li><li><strong>Jan</strong>: An open-source ChatGPT alternative that runs locally</li></ul><h3>4. Python Integration</h3><p>You can use GGUF models directly in Python applications:</p><pre>from llama_cpp import Llama<br><br># Load a GGUF model<br>llm = Llama(<br>    model_path=&quot;model.gguf&quot;,<br>    n_gpu_layers=20,  # Offload some layers to GPU if available<br>    n_ctx=4096,       # Context window size<br>)<br><br># Generate text<br>output = llm(<br>    &quot;Explain quantum computing in simple terms:&quot;,<br>    max_tokens=200,<br>    temperature=0.7<br>)<br><br>print(output[&#39;choices&#39;][0][&#39;text&#39;])</pre><h3>5. API Servers</h3><p>llama.cpp includes a server mode that provides OpenAI-compatible API endpoints, meaning you can run local models but use the same code that works with ChatGPT:</p><pre># Start a server<br>./llama-server -m model.gguf --port 8080<br><br># Use it like OpenAI&#39;s API<br>curl http://localhost:8080/v1/chat/completions \<br>  -H &quot;Content-Type: application/json&quot; \<br>  -d &#39;{<br>    &quot;messages&quot;: [<br>      {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;Hello!&quot;}<br>    ]<br>  }&#39;</pre><h3>The Quantization Quality Spectrum</h3><p>Understanding which quantization level to choose depends on your use case:</p><p><strong>For Experimentation</strong> (Q2_K, Q3_K):</p><ul><li>Smallest file sizes</li><li>Fastest inference</li><li>Noticeable quality degradation</li><li>Good for testing if a model works for your purpose</li></ul><p><strong>For Production Use</strong> (Q4_K_M, Q4_K_S):</p><ul><li>Excellent balance</li><li>4–5x smaller than original</li><li>Minimal quality loss for most tasks</li><li>Most popular choice</li></ul><p><strong>For Professional Applications</strong> (Q5_K, Q6_K):</p><ul><li>Higher quality</li><li>Good for tasks requiring nuance</li><li>Still 50–70% smaller than original</li></ul><p><strong>For Maximum Quality</strong> (Q8_0, F16):</p><ul><li>Nearly identical to original</li><li>When you have the RAM and need the best results</li><li>Research or evaluation</li></ul><h3>Real-World Applications</h3><h3>Education</h3><p>Students can now run AI models on laptops for learning without expensive cloud bills. A programming student can have a code assistant running locally while learning.</p><h3>Privacy-Sensitive Industries</h3><p>Healthcare, legal, and financial sectors can use AI without sending sensitive data to external APIs. A law firm can analyze contracts with AI while keeping client information on premises.</p><h3>Offline Applications</h3><p>Researchers in remote locations, airlines, ships, and other disconnected environments can use AI capabilities without internet access.</p><h3>Development and Testing</h3><p>Developers can iterate quickly without API rate limits or costs. You can test thousands of prompts without worrying about your bill.</p><h3>Edge Devices</h3><p>Running AI models on smartphones, embedded systems, and IoT devices becomes feasible with GGUF’s efficiency.</p><h3>The Ecosystem Around GGUF</h3><p>The format has spawned an entire ecosystem:</p><p><strong>Model Repositories</strong>: Hugging Face hosts thousands of GGUF models, with users like TheBloke (now maintained by the community) providing pre-quantized versions of popular models.</p><p><strong>Conversion Tools</strong>: Automated tools for converting models from PyTorch, TensorFlow, and other frameworks to GGUF.</p><p><strong>Hardware Optimizations</strong>: Continuous improvements for Apple Silicon, AMD GPUs, and various CPU architectures.</p><p><strong>Community Tools</strong>: Model merging utilities, fine-tuning workflows, and performance benchmarking tools.</p><h3>Advantages Over Other Formats</h3><p>Why GGUF over alternatives like ONNX or GPTQ?</p><p><strong>Versus ONNX</strong>:</p><ul><li>GGUF is specifically optimized for LLMs, not general neural networks</li><li>Better quantization support for language models</li><li>Simpler deployment without additional dependencies</li></ul><p><strong>Versus GPTQ</strong>:</p><ul><li>GPTQ requires GPU for inference; GGUF works on CPUs</li><li>GGUF offers more quantization options</li><li>GGUF files are self-contained with all metadata</li></ul><p><strong>Versus Original PyTorch Models</strong>:</p><ul><li>4–8x smaller file sizes</li><li>No Python runtime required</li><li>Cross-platform compatibility without framework dependencies</li></ul><h3>Practical Tips for Working with GGUF</h3><h3>Choosing the Right Quantization</h3><p>Start with Q4_K_M for most use cases. If the quality isn’t sufficient, move up to Q5_K. If you need maximum speed and size isn’t an issue, try Q8_0.</p><h3>Memory Considerations</h3><p>Model parameters are offloaded between system RAM and GPU VRAM based on the n_gpu_layers setting. If you have 8GB of VRAM, you might offload 20–30 layers to GPU and keep the rest in RAM.</p><h3>Context Length</h3><p>Longer context windows require more memory. A 4K context uses significantly less RAM than 8K or 16K. Start small and increase as needed.</p><h3>File Naming Conventions</h3><p>GGUF files follow naming patterns like:</p><ul><li>llama-2-7b.Q4_K_M.gguf — Model name, quantization method</li><li>mistral-7b-instruct-q5_k.gguf — Lowercase variations exist too</li></ul><p>The pattern helps you identify what you’re downloading at a glance.</p><h3>The Future of GGUF</h3><p>The format continues to evolve. Recent developments include:</p><ul><li>Support for multimodal models (combining text and images)</li><li>FlashAttention integration for faster processing</li><li>Better memory mapping for ultra-large models</li><li>Improved quantization methods balancing quality and size</li></ul><p>The format is designed to be extensible, allowing new features to be added without breaking compatibility with existing models, ensuring GGUF will remain relevant as AI technology advances.</p><h3>Getting Started Today</h3><p>Want to try GGUF yourself? Here’s the quickest path:</p><ol><li><strong>Download LM Studio</strong> (easiest for beginners) — it handles everything with a GUI</li><li><strong>Or install llama.cpp</strong> if you prefer command-line control</li><li><strong>Find a model</strong> on Hugging Face (search for “GGUF” in the model name)</li><li><strong>Start with a smaller model</strong> (3B-7B parameters) to see how it performs on your hardware</li><li><strong>Experiment with quantization levels</strong> to find your ideal balance</li></ol><h3>Why GGUF Matters</h3><p>In the broader context of AI democratization, GGUF represents a critical stepping stone. It proves that powerful AI doesn’t require data centers or expensive hardware. It puts sophisticated language models in the hands of students, researchers, small businesses, and individuals worldwide.</p><p>The format exemplifies open-source collaboration at its best — created by the community, for the community, and continuously improved by thousands of contributors. It’s not controlled by any single company and works with models from any source.</p><p>As AI becomes increasingly central to how we work and create, formats like GGUF ensure that the technology remains accessible, private, and under user control. That’s the kind of future worth building.</p><h3>Key Takeaways</h3><ul><li>GGUF is a highly optimized file format for running large AI models efficiently</li><li>Created by Georgi Gerganov and the llama.cpp team, introduced in August 2023</li><li>Uses quantization to compress models by 4–8x with minimal quality loss</li><li>Enables running sophisticated AI models on regular laptops and even smartphones</li><li>Self-contained format including all metadata, vocabulary, and model weights</li><li>Supported by a rich ecosystem of tools, applications, and thousands of models</li><li>Perfect for privacy-conscious users, offline applications, and cost-effective AI deployment</li></ul><p>The next time you see a “.gguf” file extension, you’ll know it’s not just a model — it’s an entire movement toward making AI accessible to everyone.</p><p><em>Want to explore more about GGUF and local AI models? Check out the llama.cpp project on GitHub and the GGUF model collection on Hugging Face. The community is active, welcoming, and always happy to help newcomers get started.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=30dcb45358da" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Understanding Binary Files: A Beginner’s Guide to Reading Them in JavaScript]]></title>
            <link>https://pguso.medium.com/understanding-binary-files-a-beginners-guide-to-reading-them-in-javascript-ec7aaa564638?source=rss-b1afabf2359d------2</link>
            <guid isPermaLink="false">https://medium.com/p/ec7aaa564638</guid>
            <category><![CDATA[binary]]></category>
            <category><![CDATA[javascript]]></category>
            <category><![CDATA[buffer]]></category>
            <dc:creator><![CDATA[Patric]]></dc:creator>
            <pubDate>Thu, 18 Dec 2025 18:56:40 GMT</pubDate>
            <atom:updated>2025-12-18T18:56:40.632Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*FSSQFHPY6BGM_gEh8MTFKA.jpeg" /></figure><p>Have you ever wondered what’s really inside an image file, a PDF, or a video? Unlike the text files you’re used to working with, these are <strong>binary files</strong> — and they speak a different language. Let’s demystify what binary files are and learn how to read them using JavaScript.</p><h3>What Exactly Is a Binary File?</h3><p>Think of files like books written in different languages. A text file is like a book written in English — you can open it with any text editor and read it directly. The words make sense because they use the same alphabet you know.</p><p>A binary file, on the other hand, is like a book written in an ancient hieroglyphic system. You can’t just open it with a text editor and expect to understand what you see. Instead, you need special tools that know how to decode those symbols.</p><p><strong>In technical terms:</strong> Binary files store data as sequences of bytes (numbers from 0 to 255) that represent anything from pixels in an image to audio frequencies to compressed data. They’re the most efficient way computers store complex information.</p><blockquote>A binary file is like a long row of tiny numbered boxes (bytes).<br>Each box holds a number between <strong>0 and 255</strong>.<br>Computers use these numbers as instructions to rebuild things like <strong>pictures, sounds, or videos</strong>.<br>It’s the fastest and most space-saving way for a computer to remember complex things.</blockquote><h3>Real-World Examples</h3><p>Here are binary files you encounter every day:</p><ul><li><strong>Images</strong>: .jpg, .png, .gif — each byte might represent a pixel&#39;s color</li><li><strong>Videos</strong>: .mp4, .avi — sequences of compressed video frames and audio</li><li><strong>Documents</strong>: .pdf, .docx — formatted documents with embedded images and fonts</li><li><strong>Audio</strong>: .mp3, .wav — sound wave data encoded as numbers</li><li><strong>Executables</strong>: .exe, .app — programs your computer can run</li></ul><h3>Why Can’t You Just Read Them Like Text?</h3><p>Let’s do a quick experiment. If you open a JPEG image in a text editor, you might see something like this:</p><pre>ÿØÿàJFIFÿÛC��������������������</pre><p>That gibberish happens because your text editor is trying to interpret raw bytes as letters. It’s like trying to read sheet music as if it were a novel — the symbols mean something, just not what you’re expecting them to mean.</p><h3>How JavaScript Reads Binary Files</h3><p>JavaScript gives us several powerful tools to work with binary data. Let’s explore them step by step.</p><h3>The ArrayBuffer: Your Binary Data Container</h3><p>An ArrayBuffer is like a raw storage box for binary data. It&#39;s just a fixed-length sequence of bytes sitting in memory.</p><pre>// Create a buffer that can hold 8 bytes<br>const buffer = new ArrayBuffer(8);<br>console.log(buffer.byteLength); // 8</pre><p>Think of an ArrayBuffer as a row of 8 boxes, each capable of holding one byte (a number from 0 to 255). But here’s the catch: you can’t directly put data into an ArrayBuffer. You need a “view” to interact with it.</p><h3>Views: Looking at Your Data Different Ways</h3><p>This is where it gets interesting. The same binary data can be interpreted in different ways, just like how the number “1000” could mean 1000 dollars, 1000 meters, or 10:00 on a clock depending on context.</p><p>JavaScript provides different “views” to read the same ArrayBuffer:</p><pre>const buffer = new ArrayBuffer(8);<br><br>// View it as 8-bit unsigned integers (0-255)<br>const uint8View = new Uint8Array(buffer);<br>uint8View[0] = 255;<br>// View the SAME buffer as 16-bit integers<br>const uint16View = new Uint16Array(buffer);<br>console.log(uint16View[0]); // Reads the first 2 bytes together</pre><p>Common views include:</p><ul><li>Uint8Array — treats each byte as a number 0-255</li><li>Int16Array — treats pairs of bytes as numbers from -32,768 to 32,767</li><li>Float32Array — interprets 4 bytes as decimal numbers</li><li>DataView — lets you read different types from anywhere in the buffer</li></ul><h3>Reading a Real Binary File in JavaScript</h3><p>Now let’s put this knowledge into practice. Here’s how you’d read an image file in the browser:</p><pre>// HTML: &lt;input type=&quot;file&quot; id=&quot;fileInput&quot;&gt;<br>document.getElementById(&#39;fileInput&#39;).addEventListener(&#39;change&#39;, async (event) =&gt; {<br>  const file = event.target.files[0];<br>  <br>  // Read the file as an ArrayBuffer<br>  const arrayBuffer = await file.arrayBuffer();<br>  <br>  // Create a view to examine the bytes<br>  const bytes = new Uint8Array(arrayBuffer);<br>  <br>  // Look at the first few bytes (the &quot;magic number&quot;)<br>  console.log(&#39;First 4 bytes:&#39;, <br>    bytes[0], bytes[1], bytes[2], bytes[3]);<br>  <br>  // PNG files start with: 137, 80, 78, 71<br>  // JPEG files start with: 255, 216, 255<br>  if (bytes[0] === 255 &amp;&amp; bytes[1] === 216) {<br>    console.log(&#39;This is a JPEG image!&#39;);<br>  }<br>});</pre><h3>The Magic Number Trick</h3><p>Professional tip: most binary files start with a “magic number” — specific bytes that identify the file type. It’s like how books have ISBN numbers. By checking the first few bytes, you can determine what kind of file you’re dealing with.</p><h3>A Practical Example: Building an Image Analyzer</h3><p>Let’s create something useful — a tool that tells you basic information about an uploaded image:</p><pre>async function analyzeImage(file) {<br>  const buffer = await file.arrayBuffer();<br>  const bytes = new Uint8Array(buffer);<br>  <br>  // Determine file type<br>  let fileType = &#39;Unknown&#39;;<br>  if (bytes[0] === 255 &amp;&amp; bytes[1] === 216) {<br>    fileType = &#39;JPEG&#39;;<br>  } else if (bytes[0] === 137 &amp;&amp; bytes[1] === 80) {<br>    fileType = &#39;PNG&#39;;<br>  } else if (bytes[0] === 71 &amp;&amp; bytes[1] === 73) {<br>    fileType = &#39;GIF&#39;;<br>  }<br>  <br>  return {<br>    name: file.name,<br>    size: `${(file.size / 1024).toFixed(2)} KB`,<br>    type: fileType,<br>    totalBytes: bytes.length<br>  };<br>}<br><br>// Usage:<br>const info = await analyzeImage(myFile);<br>console.log(info);<br>// Output: { name: &quot;photo.jpg&quot;, size: &quot;245.32 KB&quot;, type: &quot;JPEG&quot;, totalBytes: 251208 }</pre><h3>Reading Binary Data from the Web</h3><p>You can also fetch binary files from URLs:</p><pre>async function downloadBinaryFile(url) {<br>  const response = await fetch(url);<br>  const arrayBuffer = await response.arrayBuffer();<br>  const bytes = new Uint8Array(arrayBuffer);<br>  <br>  console.log(`Downloaded ${bytes.length} bytes`);<br>  return bytes;<br>}<br><br>// Download an image<br>const imageData = await downloadBinaryFile(&#39;https://example.com/photo.jpg&#39;);</pre><h3>Converting Between Formats</h3><p>Sometimes you need to convert binary data to other formats:</p><pre>// Binary to Base64 (useful for embedding images in HTML/CSS)<br>function binaryToBase64(bytes) {<br>  let binary = &#39;&#39;;<br>  for (let i = 0; i &lt; bytes.length; i++) {<br>    binary += String.fromCharCode(bytes[i]);<br>  }<br>  return btoa(binary);<br>}<br><br>// Base64 to Binary<br>function base64ToBinary(base64) {<br>  const binary = atob(base64);<br>  const bytes = new Uint8Array(binary.length);<br>  for (let i = 0; i &lt; binary.length; i++) {<br>    bytes[i] = binary.charCodeAt(i);<br>  }<br>  return bytes;<br>}</pre><h3>Common Use Cases</h3><p>Here’s when you’ll actually need to work with binary files:</p><ol><li><strong>File uploads</strong>: Reading files users select from their computer</li><li><strong>Image processing</strong>: Manipulating pixels, applying filters, or converting formats</li><li><strong>PDF generation</strong>: Creating documents programmatically</li><li><strong>Audio/Video processing</strong>: Working with media files</li><li><strong>Data compression</strong>: Creating zip files or compressing data</li><li><strong>Cryptography</strong>: Encrypting and decrypting data</li><li><strong>Network protocols</strong>: Sending/receiving binary data over WebSockets</li></ol><h3>Key Takeaways</h3><p>Binary files are everywhere, and understanding how to work with them opens up a world of possibilities in JavaScript:</p><ul><li>Binary files store data as raw bytes, not human-readable text</li><li>ArrayBuffer is your container for binary data</li><li>Typed arrays (like Uint8Array) let you read and manipulate those bytes</li><li>Different views interpret the same bytes in different ways</li><li>File magic numbers help identify file types</li><li>The File API and Fetch API make reading binary data straightforward</li></ul><h3>Where to Go from Here</h3><p>Now that you understand the basics, you can explore:</p><ul><li>Using libraries like pdfkit for PDF generation</li><li>Working with the Canvas API to manipulate image pixels</li><li>Exploring WebGL for 3D graphics (heavily reliant on binary data)</li><li>Building file converters or image processors</li><li>Learning about file compression algorithms</li></ul><p>Binary files might have seemed mysterious at first, but they’re just another way of organizing data — and now you have the tools to read them. Happy coding!</p><p><em>Have questions about working with binary files in JavaScript? Drop them in the comments below!</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ec7aaa564638" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building RAG from Scratch: Understanding AI’s Knowledge Retrieval Without the Black Boxes]]></title>
            <link>https://pguso.medium.com/building-rag-from-scratch-understanding-ais-knowledge-retrieval-without-the-black-boxes-d693d7be2b2d?source=rss-b1afabf2359d------2</link>
            <guid isPermaLink="false">https://medium.com/p/d693d7be2b2d</guid>
            <category><![CDATA[semantic-search]]></category>
            <category><![CDATA[retrieval-augmented-gen]]></category>
            <category><![CDATA[vector-database]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[ai]]></category>
            <dc:creator><![CDATA[Patric]]></dc:creator>
            <pubDate>Sun, 30 Nov 2025 22:22:33 GMT</pubDate>
            <atom:updated>2025-11-30T22:22:33.356Z</atom:updated>
            <content:encoded><![CDATA[<p>Ever wondered how ChatGPT or Claude can answer questions about your specific documents? The secret isn’t magic – it’s Retrieval-Augmented Generation (RAG). And if you’ve ever felt lost in a sea of abstraction when trying to understand it, I’ve got good news: you can build it yourself, from scratch, and actually understand what’s happening under the hood.</p><p>I just published <strong>rag-from-scratch</strong>, an open-source educational project that demystifies RAG by walking you through building it step by step, with no cloud APIs, no black boxes – just clear explanations and local code you can run and understand.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*MpAPBzj9jeL_d8Dy3XKlRA@2x.jpeg" /></figure><h3><strong>The Problem with „Just Use This Framework“</strong></h3><p>Most RAG tutorials follow a familiar pattern: import a framework, call a few functions, and voilà – magic happens. But what actually happened? How do embeddings work? Why does the retrieval sometimes fail? How can you debug or improve it?</p><p>The philosophy behind this project is simple: <strong>if you can explain it, you can build it. If you can build it, you can improve it. </strong>This is the same approach I took with my previous project, AI Agents from Scratch, which helped developers understand agentic AI by building it themselves.</p><h3>What You’ll Actually Learn</h3><p>RAG isn’t rocket science, but it involves several moving parts that work together:</p><p><strong>The Core Pipeline:</strong></p><ol><li><strong>Knowledge Requirements</strong> – Define what questions you need to answer and what data you need</li><li><strong>Data Loading</strong> – Import and structure your documents</li><li><strong>Text Splitting</strong> – Divide documents into manageable chunks</li><li><strong>Embedding</strong> – Convert text into numerical vectors that capture meaning</li><li><strong>Vector Store</strong> – Index embeddings for fast similarity search</li><li><strong>Retrieval</strong> – Fetch the most relevant context for a query</li><li><strong>Re-Ranking</strong> – Improve precision by reordering results</li><li><strong>Augmentation</strong> – Merge retrieved context into the LLM’s prompt</li><li><strong>Generation</strong> – Produce grounded answers using a local LLM</li></ol><p>Each step is crucial, and each step is demystified in this repository.</p><h3>A Learning Path That Actually Works</h3><p>The repository is structured as a progressive learning journey. You don’t start by building a production system – you start by understanding the fundamentals:</p><h4>How RAG Really Works</h4><p>Before touching embeddings or vector databases, you’ll see RAG in action with a minimal simulation in under 70 lines of code. This uses naive keyword search to demonstrate the core concept: retrieve context, then generate an answer. It’s simple, but it crystallizes the fundamental idea.</p><h4>Understanding Embeddings</h4><p>Instead of treating embeddings as a black box, you’ll learn the math behind them. How does „king – man + woman ≈ queen“ actually work? What is cosine similarity, and why does it matter? You’ll implement text similarity from scratch before using any libraries.</p><h4>Building Your Own Vector Store</h4><p>You’ll build an in-memory vector store that actually stores embeddings and performs nearest-neighbor search. No magic – just arrays, distance calculations, and indexing logic you can see and understand.</p><h4>Advanced Retrieval Strategies</h4><p>Once you understand the basics, you’ll level up with techniques that dramatically improve results:</p><ul><li>Query preprocessing and normalization</li><li>Hybrid search strategies</li><li>Multi-query retrieval</li><li>Post-retrieval re-ranking to reduce noise</li></ul><p>Each example includes three things:</p><ol><li>Working code (`example.js`)</li><li>2. A detailed code explanation (`CODE.md`)</li><li>3. A conceptual explanation (`CONCEPT.md`)</li></ol><p>Nothing is hidden. Every function is explained. Every concept is broken down.</p><h3>Why Local? Why No Cloud APIs?</h3><p>This project runs entirely on your machine using local LLMs (via `node-llama-cpp`). Why?</p><ol><li><strong>True Understanding</strong> – When you run code locally, you can debug it, inspect it, and truly understand what’s happening at each step</li><li><strong>No Costs </strong>– Experiment freely without worrying about API bills</li><li><strong>Privacy</strong> – Your documents never leave your machine</li><li><strong>Complete Control</strong> – Modify, extend, and customize every component</li></ol><p>This isn’t about building the fastest or most scalable RAG system. It’s about building understanding.</p><h3>What’s Available Now (and What’s Coming)</h3><p>The repository is actively being developed with an educational-first approach. Currently available:</p><p>✅ Core concepts (how RAG works, LLM basics)</p><p>✅ Data loading and text splitting.</p><p>✅ Embeddings and similarity.</p><p>✅ Vector store implementation.</p><p>✅ Basic retrieval strategies.</p><p>Coming soon:</p><p>🚧 Advanced retrieval techniques.</p><p>🚧 Prompt engineering for RAG.</p><p>🚧 Evaluation metrics.</p><p>🚧 Graph database integration.</p><p>🚧 Production-ready templates.</p><p>Each topic will be added thoughtfully, with the same commitment to clarity and depth.</p><h3>Why This Matters</h3><p>We’re in an era where AI is rapidly becoming commoditized through APIs and frameworks. That’s powerful, but it creates a generation of developers who can <em>use</em> AI without truly <em>understanding</em> it. When things break (and they will), or when you need to optimize for your specific use case, that understanding becomes critical.</p><p>RAG is one of the most practical applications of LLMs today – it’s how we make AI useful for real-world knowledge tasks. Understanding how it works, not just how to call it, makes you a better AI engineer.</p><h3>Try It Yourself</h3><p>Getting started is simple:</p><pre>git clone https://github.com/pguso/rag-from-scratch.git<br><br>cd rag-from-scratch<br><br>npm install<br><br>node examples/00_how_rag_works/example.js</pre><h3>Join the Journey</h3><p>This project is open source and welcomes contributions. If you have a clear, educational example or improvement, pull requests are encouraged. The goal is to build the best educational resource for understanding RAG, one example at a time.</p><p>RAG doesn’t have to be a black box. You can understand it. You can build it. And once you do, you’ll be equipped to improve it, debug it, and adapt it to your needs.</p><p><strong>Check out the repository</strong>: <a href="https://github.com/pguso/rag-from-scratch">https://github.com/pguso/rag-from-scratch</a></p><p>If you can explain it, you can build it. If you can build it, you can improve it.</p><p>Let’s demystify RAG together.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d693d7be2b2d" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Every AI Agent Tutorial Skips the Fundamentals. So I Built Them.]]></title>
            <link>https://pguso.medium.com/every-ai-agent-tutorial-skips-the-fundamentals-so-i-built-them-effe9befeb42?source=rss-b1afabf2359d------2</link>
            <guid isPermaLink="false">https://medium.com/p/effe9befeb42</guid>
            <category><![CDATA[nodejs]]></category>
            <category><![CDATA[javascript]]></category>
            <category><![CDATA[ai-agent]]></category>
            <category><![CDATA[llm-applications]]></category>
            <category><![CDATA[llm]]></category>
            <dc:creator><![CDATA[Patric]]></dc:creator>
            <pubDate>Mon, 27 Oct 2025 07:53:06 GMT</pubDate>
            <atom:updated>2025-10-27T07:53:06.006Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*NJVbUHBW9cFem9tvG2XcoA.jpeg" /></figure><p>Four days ago, I published a GitHub repository. I expected maybe a few stars, some polite feedback. Instead, it exploded to over 700 stars, hundreds of upvotes on Reddit, and emails from developers saying “this is exactly what I needed.”</p><p>This isn’t a story about going viral. It’s about the lonely, frustrating journey that led there — and why sometimes the hardest path teaches you the most.</p><h3>The Problem Nobody Talks About</h3><p>Here’s what no one tells you when you start learning about AI agents:</p><p>You can follow every tutorial. Copy-paste every code snippet. Get everything working. And still have absolutely no idea what you’re doing.</p><p>I was that person. For months.</p><p>Every resource I found did the same thing — jumped straight into LangChain or CrewAI or some other framework. “Just use this library,” they’d say. “Look how easy it is!” And it was easy. Until something broke.</p><p>Then I was completely lost.</p><p>Was it the framework? The prompt? The model? The way I structured my code? I had no mental model. No understanding of what was actually happening under those nice, clean APIs.</p><p>I couldn’t debug. I couldn’t customize. I couldn’t build anything beyond what the tutorials showed me.</p><h3>The Breaking Point</h3><p>The moment I decided to change everything was when I spent three hours debugging an agent that wouldn’t use a tool properly. Three hours of tweaking prompts, reading documentation, checking GitHub issues.</p><p>I never figured it out. I just tried a different framework and hoped for the best.</p><p>That’s when I realized: I didn’t want to just use AI agents. I wanted to understand them.</p><h3>Starting from Zero</h3><p>So I did something that probably seems obvious in hindsight: I started over. From scratch.</p><p>No LangChain. No CrewAI. No frameworks at all.</p><p>Just me, node-llama-cpp, local models, and a lot of documentation reading.</p><p>The first few weeks were brutal. Without the framework abstractions, I had to figure out everything:</p><ul><li>How does the model actually receive function definitions?</li><li>What format does function calling really use?</li><li>How does memory work at a fundamental level?</li><li>What is the ReAct pattern actually doing?</li></ul><p>I made a spreadsheet of every agent concept I wanted to understand. Then I built tiny, focused examples for each one. No fancy features. No production-ready code. Just the absolute minimum needed to understand the concept.</p><p>Example 1: A basic LLM call. That’s it.<br>Example 2: System prompts and specialization.<br>Example 3: Streaming responses.</p><p>Each one built on the last. Each one forced me to understand one more piece of the puzzle.</p><h3>The “Aha!” Moments</h3><p>Around week six, something shifted.</p><p>I was building an example for function calling when it finally clicked. Function calling isn’t magic. It’s just structured output. The model returns JSON that matches a schema you provide, and you parse it and execute code.</p><p>That’s it. That’s the whole thing.</p><p>But understanding that one simple fact changed everything. Suddenly I could debug function calling issues. I could customize the behavior. I could see why frameworks did things certain ways.</p><p>Then ReAct patterns made sense. Memory systems made sense. Tool chaining made sense.</p><p>It was like learning to read. Once you understand the fundamentals, everything else is just combinations of things you already know.</p><h3>Why I’m Sharing This</h3><p>After months of trial and error, I had dozens of examples. Some were dead ends. Some were too complex. Some taught me a lot but wouldn’t help anyone else.</p><p>So I did something harder than building: I curated.</p><p>I picked eight examples that formed a perfect learning path. Each one focused on a single fundamental concept. Each one built naturally on the last. Each one I polished until the code was as clear as I could make it — not production-ready, but teaching-ready.</p><p>Plain JavaScript. No framework magic. Just the concepts you absolutely need to understand.</p><p>I almost didn’t publish it. “Who would want this?” I thought. “Everyone just uses frameworks.”</p><p>But I remembered how lost I felt. How every tutorial assumed knowledge I didn’t have. How desperately I wanted someone to just explain the fundamentals without jumping to abstractions.</p><p>So I published those eight examples, hoping they’d be a starting point. My vision wasn’t to create the definitive resource — it was to plant a seed. Let the community add or ask for examples where they see gaps. Let it evolve into the resource developers actually need to understand agents deeply before they jump into frameworks.</p><p>A living tutorial, shaped by the people learning from it.</p><p>So I put it on GitHub: <a href="https://github.com/pguso/ai-agents-from-scratch">ai-agents-from-scratch</a></p><h3>What Happened Next</h3><p>The response shocked me.</p><p>Within four days:</p><ul><li>734 GitHub stars</li><li>76 forks</li><li>495 upvotes on Reddit</li><li>Dozens of comments from developers saying “this is exactly what I needed”, E-Mails, LinkedIn contacts.</li></ul><p>But the message that meant the most came from a team lead who went through the entire tutorial and sent me detailed feedback on every example. He’s sharing it with his team.</p><p>That’s when I knew: I wasn’t alone in my frustration.</p><p>There are thousands of developers who want to understand AI agents, not just use them. Who want to know what’s happening under the hood. Who learn by building.</p><h3>What You’ll Actually Learn</h3><p>The repository covers eight progressive examples:</p><ol><li><strong>Basic LLM interaction</strong> — Understanding the foundation</li><li><strong>System prompts</strong> — Making specialized agents</li><li><strong>Streaming</strong> — Handling real-time responses</li><li><strong>Translation agent</strong> — Applying concepts to real tasks</li><li><strong>Function calling</strong> — The core of agent behavior</li><li><strong>Batch processing</strong> — Handling multiple tasks efficiently</li><li><strong>ReAct agent</strong> — The reasoning and acting pattern</li><li><strong>Memory systems</strong> — Making agents remember context</li></ol><p>Everything runs locally. You need Node.js and a GGUF model (I use Qwen 1.7B, which runs on modest hardware). No API keys. No cloud costs. Just you and the fundamentals.</p><p>Each example includes:</p><ul><li>Heavily commented code that explains every decision</li><li>Concept explanations that connect to the bigger picture</li><li>Suggestions for experimentation and extension</li></ul><h3>The Philosophy</h3><p>Here’s what makes this different from other tutorials:</p><p><strong>No frameworks.</strong> You see exactly what’s happening at every step. No black boxes.</p><p><strong>Progressive complexity.</strong> Each example introduces one new concept. No overwhelming you with everything at once.</p><p><strong>Local-first.</strong> Run everything on your machine. Experiment without worrying about costs or rate limits.</p><p><strong>Explanation over efficiency.</strong> The code isn’t optimized. It’s optimized for understanding.</p><h3>Who This Is For</h3><p>You might find this useful if:</p><ul><li>You’ve used LangChain but don’t understand what it’s doing</li><li>You want to build custom agents but don’t know where to start</li><li>You’re tired of tutorials that skip the fundamentals</li><li>You learn by actually building things</li><li>You want to know why patterns work, not just that they work</li></ul><h3>What I Learned About Learning</h3><p>Building this taught me something important about technical education:</p><p>Sometimes the fastest path to understanding is the slowest path to results.</p><p>Frameworks are amazing. They let you build complex systems quickly. But if you start with frameworks, you’re building on a foundation you don’t understand.</p><p>When you build from scratch — even if it’s harder, even if it takes longer — you develop intuition. You understand trade-offs. You can debug. You can customize. You can innovate.</p><p>The irony is that now I use frameworks all the time. But I use them differently. I know when to lean on them and when to go around them. I can read their source code and understand what’s happening. I can contribute improvements.</p><p>That only happened because I took the time to understand the fundamentals.</p><h3>Start Here</h3><p>If you’re ready to really understand AI agents, here’s what I’d suggest:</p><ol><li>Clone the repository</li><li>Download a small GGUF model (instructions included DOWNLOAD.md)</li><li>Start with intro.js and work through the examples in order</li><li>Don’t rush. Take time to modify and experiment with each one</li><li>Break things. That’s how you learn what’s actually happening.</li></ol><p>The journey from confusion to clarity isn’t quick. But it’s worth it.</p><h3>A Final Thought</h3><p>When I started this journey, I thought I was alone in my frustration. The response to this repository showed me I wasn’t.</p><p>There are thousands of us trying to understand this technology deeply. Not just to use it, but to build on it. To push it forward. To know how it really works.</p><p>If you’re one of those people, this is for you.</p><p>Let me know what you build.</p><p><em>Find the repository at: </em><a href="https://github.com/pguso/ai-agents-from-scratch"><em>github.com/pguso/ai-agents-from-scratch</em></a></p><p><em>Questions? Issues? Contributions? The repo is actively maintained and I read every comment.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=effe9befeb42" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Still Using Google Colab? It’s Time to Grow Up]]></title>
            <link>https://pguso.medium.com/still-using-google-colab-its-time-to-grow-up-8a0adcfd11d8?source=rss-b1afabf2359d------2</link>
            <guid isPermaLink="false">https://medium.com/p/8a0adcfd11d8</guid>
            <category><![CDATA[jupyter-notebook]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[google-colab]]></category>
            <category><![CDATA[cloud-computing]]></category>
            <category><![CDATA[python]]></category>
            <dc:creator><![CDATA[Patric]]></dc:creator>
            <pubDate>Fri, 03 Oct 2025 14:08:19 GMT</pubDate>
            <atom:updated>2025-10-03T14:08:19.818Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*oFCtv6cOQJR03oGUuvGi0Q.jpeg" /></figure><p>Look, we need to talk. I know Google Colab was there for you when you were just starting out. It was free, it was simple, and it gave you your first taste of GPU computing without asking for a credit card. That’s beautiful. Really.</p><p>But let’s be honest: you’re not a beginner anymore. You’re doing serious work now. And yet, you’re still sitting there, refreshing your browser every 30 minutes to keep your session alive, praying that your training run doesn’t get preempted at hour 11 of your 12-hour limit. You’re managing a spreadsheet to track your “compute units” like some kind of medieval currency exchange. You’re emailing .ipynb files to your teammates like it&#39;s 2015.</p><p>It’s time to meet Modal Notebooks. And no, this isn’t just another cloud notebook. This is what happens when someone actually <em>thinks</em> about developer experience in 2025.</p><h3>The Reality Check: What You’re Actually Paying</h3><p>Let’s start with the uncomfortable truth about Google Colab’s pricing. Because once you scratch beneath that “free tier” marketing veneer, things get… interesting.</p><h3>Google Colab’s Compute Unit Shell Game</h3><p>Colab charges you in “compute units” that cost $0.10 each, bundled in packs of 100 for $10. The T4 GPU burns through 1.96 units per hour, the V100 uses 5 units per hour, and the A100 demolishes 15 units per hour. But here’s the kicker: even installing Python libraries consumes your compute units. Yes, you read that right. Setting up your environment costs money.</p><p>Let’s do the math for a typical research workflow:</p><p><strong>Google Colab Pro ($10/month + compute units):</strong></p><ul><li>Base subscription: $9.99/month</li><li>100 compute units included</li><li><strong>T4 GPU:</strong> ~51 hours maximum (but good luck getting guaranteed access)</li><li><strong>V100 GPU:</strong> ~20 hours maximum</li><li><strong>A100 GPU:</strong> ~6.7 hours maximum</li><li>After you burn through units: Pay another $10 for 100 more</li><li><strong>Reality check:</strong> Environment setup, idle time, and failed experiments all eat your units</li></ul><p><strong>Effective hourly rates once your “free” units run out:</strong></p><ul><li>T4: ~$0.20/hour (but availability not guaranteed)</li><li>V100: ~$0.50/hour</li><li>A100: ~$1.50/hour</li></ul><h3>Modal: Actual Transparent Pricing</h3><p>Modal gives you $30 in free compute credits every month, and then you pay only for the exact compute you use, measured per second. No subscriptions. No compute unit conversion charts. No PhD in pricing models required.</p><p><strong>Modal Notebooks (Pay for what you use):</strong></p><ul><li><strong>$30 free credits monthly</strong> (that’s triple Colab’s effective credits)</li><li>T4 GPU: <strong>$0.59/hour</strong> ($0.000164/second)</li><li>L4 GPU: <strong>$0.80/hour</strong> ($0.000222/second)</li><li>A10G GPU: <strong>$1.10/hour</strong> ($0.000306/second)</li><li>A100 (40GB): <strong>$2.10/hour</strong> ($0.000583/second)</li><li>A100 (80GB): <strong>$2.50/hour</strong> ($0.000694/second)</li><li>H100: <strong>$3.95/hour</strong> ($0.001097/second)</li></ul><p><strong>The difference?</strong> You only pay when your kernel is actually running. No zombie sessions draining your wallet. When you stop, you stop paying. Immediately.</p><h3>The Cost Comparison: Real World Scenarios</h3><h3>Scenario 1: The Weekend Warrior</h3><p>You’re fine-tuning a model over the weekend. Let’s say 20 hours on an A100.</p><p><strong>Colab Pro:</strong></p><ul><li>Base subscription: $9.99</li><li>Compute units needed: 300 units (20 hours × 15 units/hour)</li><li>Cost: $9.99 + $30 = <strong>$39.99</strong></li><li><em>Plus</em> you spent units just installing dependencies</li></ul><p><strong>Modal:</strong></p><ul><li>20 hours on A100 (40GB): $2.10 × 20 = <strong>$42.00</strong></li><li>But you have $30 in free credits = <strong>$12.00 total</strong></li><li>And you didn’t pay for setup time or idle sessions</li></ul><h3>Scenario 2: The Production Researcher</h3><p>You’re running experiments across different GPU types, switching as needed for optimal cost/performance.</p><p><strong>Colab Pro:</strong></p><ul><li>You’re locked into whatever GPU they give you</li><li>T4 availability under Colab Pro is not guaranteed, often necessitating costlier alternatives</li><li>You can’t easily switch mid-session</li><li>You’re paying compute units while you context-switch between notebooks</li></ul><p><strong>Modal:</strong></p><ul><li>Switch GPU types in under 5 seconds</li><li>Run T4 for data preprocessing: $0.59/hour</li><li>Switch to H100 for training: $3.95/hour</li><li>Drop back to L4 for inference testing: $0.80/hour</li><li><strong>Pay only for what each job actually needs</strong></li></ul><h3>Scenario 3: The Team Player</h3><p>Your research team of 4 people needs to collaborate on model development.</p><p><strong>Colab Pro:</strong></p><ul><li>4 × $9.99 = $39.96/month in subscriptions</li><li>4 × 100 compute units = 400 units = ~26 hours on A100 total</li><li>Collaboration means emailing notebooks back and forth</li><li>Everyone maintains their own environment</li><li>Total chaos when someone’s units run out mid-week</li></ul><p><strong>Modal:</strong></p><ul><li>True collaborative editing with multiple cursors and live edits, like Google Docs</li><li>4 × $30 = $120 in free credits monthly</li><li>That’s <strong>48 hours of A100 time free</strong>, or 203 hours of T4</li><li>Shared Volumes, Secrets, and Functions across the entire team</li><li>Everyone sees the same environment, instantly</li></ul><h3>Beyond Pricing: Why Modal Actually Works Better</h3><h3>Cold Start Times That Don’t Make You Age</h3><p>Modal Notebooks boot in under 5 seconds, even with custom container images and GPU allocation. Five. Seconds.</p><p>Colab? Cloud instances can take minutes to spin up. You know the drill: click “Connect,” go make coffee, come back, hope it worked.</p><h3>No More Session Babysitting</h3><p>Colab free tier limits sessions to 12 hours maximum, and even paid tiers enforce limits. You’re constantly watching the clock, manually saving checkpoints, praying your training completes.</p><p>Modal kernels auto-idle and resume, so you only pay when they’re actually running. Close your laptop. Go home. Come back tomorrow. Your work is right where you left it, and you didn’t pay a cent while you were gone.</p><h3>Real Collaboration, Not File Tennis</h3><p>Be honest: how do you currently share notebooks with your team? Email? Slack? Google Drive? Then someone makes changes, you make changes, now there are three versions floating around, and nobody knows which one has the latest results?</p><p>Modal offers true real-time collaboration where multiple users can edit and run cells simultaneously, seeing each other’s cursors in real-time. It’s 2025. Your collaboration tools should work like Google Docs, not like sneakernet.</p><h3>The Path to Production Isn’t a Rewrite</h3><p>Here’s the thing that really matters: Modal Notebooks integrate with the same Volumes, Secrets, and deployed Functions as your production Modal Apps. Your notebook experiment isn’t some isolated sandbox that you’ll need to completely rewrite to deploy. It’s already running on the same infrastructure as your production code.</p><p>That model you just trained? Export it to a Modal App with one click. Now it’s a production API endpoint. No translation needed.</p><h3>GPUs That Actually Exist</h3><p>Modal lets you scale up to 8× H100 or B200 GPUs. When’s the last time you got access to cutting-edge hardware on Colab? The types of GPUs available in Colab vary over time, and premium GPUs are subject to availability. Translation: you might get a T4 when you paid for something better.</p><p>Modal? You pick the GPU. It spins up. Every time.</p><h3>The Bottom Line</h3><p>Google Colab was revolutionary when it launched. It democratized access to GPU computing and helped countless students and researchers get started with deep learning. That’s genuinely wonderful.</p><p>But “good for getting started” isn’t the same as “good for serious work.”</p><p>You’ve outgrown it. Your projects are more complex. Your deadlines are real. Your team needs to collaborate. Your experiments need reproducibility. Your models need a path to production.</p><p>Modal gives you $30 in free compute credits monthly — more than Colab’s effective free tier. After that, you pay transparent per-second pricing with no unit conversion gymnastics. You get sub-5-second cold starts. You get real-time collaboration. You get a direct path from research to production. You get to actually choose your hardware.</p><p>Most importantly: you get to stop babysitting sessions, stop tracking compute units in spreadsheets, and stop apologizing to your teammates about “sorry, my session died again.”</p><p>It’s time to grow up. Your compute environment should too.</p><p><em>Ready to make the switch?</em> Get started with Modal Notebooks at <a href="https://modal.com/notebooks">modal.com/notebooks</a> and use your $30 in free monthly credits to see the difference yourself. Your future self will thank you.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=8a0adcfd11d8" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Understanding Attention Mechanisms: The Secret Sauce Behind Modern AI]]></title>
            <link>https://pguso.medium.com/understanding-attention-mechanisms-the-secret-sauce-behind-modern-ai-9508b9c122eb?source=rss-b1afabf2359d------2</link>
            <guid isPermaLink="false">https://medium.com/p/9508b9c122eb</guid>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[attention-mechanism]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[pytorch]]></category>
            <category><![CDATA[chatgpt]]></category>
            <dc:creator><![CDATA[Patric]]></dc:creator>
            <pubDate>Fri, 03 Oct 2025 13:07:35 GMT</pubDate>
            <atom:updated>2025-10-03T13:15:54.082Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3kpsY6pKRVwSXGQ49QL-yw.jpeg" /></figure><h4>A Step-by-Step Guide to Self-Attention with Working Code</h4><p>If you’ve ever wondered how ChatGPT and other large language models understand context so well, the answer lies in a deceptively simple yet powerful concept: <strong>attention mechanisms</strong>.</p><p><strong>The best part? You can build the core mechanism from scratch in just 50 lines of code.</strong></p><p>Let’s dive into how this breakthrough technology works — and we’ll even code it from scratch using PyTorch.</p><h3>Before You Begin</h3><p><strong>You’ll need:</strong></p><ul><li>Basic Python (reading code, understanding variables and functions)</li><li>High school math (what vectors are, basic multiplication)</li></ul><p><strong>You don’t need:</strong></p><ul><li>Deep learning experience</li><li>PyTorch knowledge</li></ul><h3>The Challenge: Understanding Long Conversations</h3><p>Imagine you’re at a dinner party, listening to someone tell a long, winding story about their vacation. As they talk about what happened on day seven, you need to remember details from day one to understand the full picture. Now imagine trying to compress that entire story into a single mental note — you’d lose crucial details, especially from the beginning.</p><p>This was exactly the problem early AI systems faced when processing language.</p><h3>The Old Approach: Encoder-Decoder Architecture</h3><p>Before transformers took over, recurrent neural networks (RNNs) were the go-to technology for tasks like language translation. Here’s how they worked:</p><p><strong>The Encoder</strong> would read an input sentence word by word — let’s say “The restaurant serves incredible pasta” — building understanding as it went. By the final word, it compressed the entire meaning into a single summary vector.</p><p><strong>The Decoder</strong> would take this compressed summary and generate the output, producing one word at a time in the target language.</p><p>Think of it like this: reading an entire news article (encoding), then explaining it to a friend (decoding) based solely on what you remember.</p><h3>The Fatal Flaw</h3><p>Here’s the problem: squashing an entire sentence into one vector is like trying to capture a symphony in a single musical note. You inevitably lose information — particularly from earlier parts of longer sequences.</p><p>For short sentences like “Hello, how are you?” this worked fine. But for complex sentences with multiple clauses and subtle meanings? The model would forget important context by the time it reached the end.</p><p><strong>The key insight:</strong> RNNs had a fundamental bottleneck, and this limitation sparked one of AI’s biggest breakthroughs.</p><h3>The Breakthrough: Self-Attention</h3><p>In 2017, researchers introduced the transformer architecture with a revolutionary idea: <strong>what if we didn’t need to compress everything into one vector at all?</strong></p><p>Instead of forcing words through a narrow bottleneck, self-attention allows every word to directly interact with every other word in the sequence. Each word can “look at” and gather information from all the others simultaneously.</p><h3>A Practical Example</h3><p>Consider the sentence: <strong>“The chef prepared the meal because she loved cooking.”</strong></p><p>When processing the word “she,” a human immediately knows it refers to “chef.” Self-attention gives AI this same ability — the word “she” can directly “attend to” or focus on “chef” to understand the connection.</p><p>This doesn’t just work for pronouns. Every word examines every other word to understand:</p><ul><li>Which words are most relevant to its meaning</li><li>How it relates to the overall sentence structure</li><li>What context it needs to be properly understood</li></ul><h3>Why “Self” Attention?</h3><p>The “self” in self-attention is important — it means the mechanism examines relationships <strong>within a single sequence</strong>.</p><p><strong>Self-attention:</strong> Looking at words within one sentence (like “The chef prepared the meal”)</p><p><strong>Regular attention:</strong> Comparing words between two different sequences (like matching English words to their French translations)</p><p>For building language models that predict the next word, self-attention is what we need.</p><h3>Coding Self-Attention From Scratch</h3><p>Let’s implement a simple version of self-attention to see exactly how it works. We’ll use the sentence:</p><p><strong>“Music brings people pure joy”</strong></p><p>Each word will be represented as a 3-dimensional embedding vector (in real models, these are typically 768 or more dimensions).</p><h4>Step 1: Set Up the Input Embeddings</h4><pre>import torch<br><br># Input embeddings: each row represents one word<br>inputs = torch.tensor(<br>    [[0.21, 0.45, 0.78],  # Music   (x^1)<br>     [0.63, 0.29, 0.91],  # brings  (x^2)<br>     [0.48, 0.72, 0.34],  # people  (x^3)<br>     [0.85, 0.19, 0.56],  # pure    (x^4)<br>     [0.37, 0.88, 0.42]]  # joy     (x^5)<br>)<br>print(&quot;Input shape:&quot;, inputs.shape)  # torch.Size([5, 3])</pre><p>Each row is a word’s embedding — a vector of numbers that represents its meaning in a high-dimensional space.</p><h4>Step 2: Calculate Attention Scores for One Query Word</h4><p>Let’s focus on the word <strong>“brings”</strong> and see how much it should attend to each word in the sentence.</p><pre># Select &quot;brings&quot; as our query word (index 1)<br>query_word = inputs[1]<br><br># Calculate attention scores: dot product with all words<br>attention_scores = torch.matmul(inputs, query_word)<br>print(f&quot;Attention scores for &#39;brings&#39;:&quot;)<br>print(attention_scores)<br><br># Output: tensor([0.9303, 1.2695, 1.0326, 1.1064, 1.1779])</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QybJPfKtlENeXbgkYAY0TA.jpeg" /></figure><blockquote>The dot product is a way to measure how similar or aligned two vectors are by multiplying their corresponding numbers together and adding up all those products — resulting in a single number where higher values mean the vectors point in more similar directions.</blockquote><p><strong>What’s happening here?</strong></p><p>For each word, we compute the dot product:</p><ul><li>Music: (0.21 × 0.63) + (0.45 × 0.29) + (0.78 × 0.91) = 0.9303</li><li>brings: (0.63 × 0.63) + (0.29 × 0.29) + (0.91 × 0.91) = 1.2695</li><li>people: (0.48 × 0.63) + (0.72 × 0.29) + (0.34 × 0.91) = 1.0326</li><li>pure: (0.85 × 0.63) + (0.19 × 0.29) + (0.56 × 0.91) = 1.1064</li><li>joy: (0.37 × 0.63) + (0.88 × 0.29) + (0.42 × 0.91) = 1.1779</li></ul><p>Higher scores indicate stronger relationships. Notice “brings” has the highest score with itself (1.2695), which makes sense!</p><h4>Step 3: Normalize Scores into Attention Weights</h4><p>Raw scores aren’t very useful — we need to convert them into probabilities that sum to 1.0. We use the softmax function for this:</p><pre># Convert scores to normalized weights (probability distribution)<br>attention_weights = torch.softmax(attention_scores, dim=0)<br><br>print(&quot;Attention weights:&quot;)<br>print(attention_weights)<br>print(f&quot;Sum of weights: {attention_weights.sum():.4f}&quot;)  # Should equal 1.0<br><br># Output:<br># tensor([0.1627, 0.2282, 0.1801, 0.1937, 0.2083])<br># Sum of weights: 1.0000</pre><p>Now we have a probability distribution! These weights tell us: “When understanding ‘brings,’ pay 22.8% attention to itself, 20.8% to ‘joy,’ 19.4% to ‘pure,’ and so on.”</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mUaNqXnNMFDXHSzyYuyPBA.jpeg" /></figure><h4>Step 4: Create the Context Vector</h4><p>The context vector is a weighted combination of all input embeddings:</p><pre># Compute context vector for &quot;brings&quot;<br># This is a weighted sum of all word embeddings<br>context_vector = torch.matmul(attention_weights, inputs)<br><br>print(&quot;Context vector for &#39;brings&#39;:&quot;)<br>print(context_vector)<br><br># Output: tensor([0.5084, 0.5056, 0.6006])</pre><p>This context vector is richer than the original embedding for “brings” because it incorporates information from the entire sentence!</p><h4>Step 5: Scale to All Words at Once</h4><p>In practice, we want context vectors for every word simultaneously. We can compute all attention scores in one matrix multiplication:</p><pre># Compute attention scores for ALL query-key pairs at once<br># Result is a 5×5 matrix where element [i, j] shows how much<br># word i should attend to word j<br>attention_scores = torch.matmul(inputs, inputs.T)<br><br>print(&quot;Attention scores (all words):&quot;)<br>print(attention_scores)<br># Output: 5×5 matrix showing all word-to-word relationships<br># Normalize each row to get attention weights<br># Each row sums to 1.0, creating a probability distribution per word<br><br>attention_weights = torch.softmax(attention_scores, dim=-1)<br><br>print(&quot;\nAttention weights matrix (5×5):&quot;)<br>print(attention_weights)<br>print(f&quot;\nSum of each row: {attention_weights.sum(dim=1)}&quot;)<br># Output: tensor([1., 1., 1., 1., 1.]) - each row sums to 1!<br># Compute all context vectors in one operation<br><br>all_context_vectors = attention_weights @ inputs<br><br>print(&quot;\nAll context vectors:&quot;)<br>print(all_context_vectors)<br># Output: 5×3 matrix where each row is a word&#39;s enriched context vector</pre><p><strong>What we’ve achieved:</strong></p><p>We started with 5 word embeddings (5×3 matrix) and transformed them into 5 context vectors (also 5×3), where each context vector contains information from all words in the sentence, weighted by relevance.</p><h3>Understanding the Output</h3><p>Let’s break down what the attention weights matrix tells us:</p><pre>          Music  brings  people  pure   joy<br>Music     [0.23,  0.19,  0.20,  0.19,  0.19]<br>brings    [0.16,  0.23,  0.18,  0.19,  0.21]<br>people    [0.20,  0.19,  0.21,  0.19,  0.20]<br>pure      [0.20,  0.19,  0.19,  0.22,  0.20]<br>joy       [0.18,  0.20,  0.19,  0.19,  0.23]</pre><p>Each row shows how much that word attends to all other words. For example:</p><ul><li>“Music” pays 23% attention to itself and distributes the rest fairly evenly</li><li>“brings” pays slightly more attention to itself (23%) and to “joy” (21%)</li><li>Each word considers the full sentence context when forming its meaning</li></ul><h3>Understanding the Context Vectors</h3><p>Let’s break down what the final context vectors represent:</p><pre>         Original Input      →  Context Vectors (enriched)<br>Music:   [0.21, 0.45, 0.78]  →  [0.51, 0.51, 0.60]<br>brings:  [0.63, 0.29, 0.91]  →  [0.51, 0.51, 0.60]<br>people:  [0.48, 0.72, 0.34]  →  [0.51, 0.51, 0.60]<br>pure:    [0.85, 0.19, 0.56]  →  [0.51, 0.51, 0.60]<br>joy:     [0.37, 0.88, 0.42]  →  [0.51, 0.51, 0.60]</pre><p>Each context vector is now a blend of all words in the sentence, weighted by their attention scores. For example:</p><ul><li>“Music” started as [0.21, 0.45, 0.78] but now incorporates 19% of “brings,” 20% of “people,” and so on</li><li>“brings” transformed from [0.63, 0.29, 0.91] into a vector that includes 21% of “joy,” 23% of itself, and 19% of “pure”</li><li>Each word’s context vector is no longer isolated — it now carries information about the entire sentence’s meaning</li></ul><p>This is the magic of self-attention: words that started with different meanings now have representations that are contextually aware of their neighbors.</p><h3>The Real-World Impact</h3><p>This simple mechanism — letting words attend to each other — solved the bottleneck problem that plagued RNNs. Instead of compressing everything into one vector and hoping nothing important gets lost, transformers maintain all information and let the model decide what’s relevant at each step.</p><h3>What We Left Out (For Now)</h3><p>This simplified version works, but real transformers add several enhancements:</p><ol><li><strong>Query, Key, and Value transformations:</strong> Instead of using raw embeddings, we apply learned weight matrices to create specialized query, key, and value vectors</li><li><strong>Multiple attention heads:</strong> The model can focus on different types of relationships simultaneously (syntax, semantics, etc.)</li><li><strong>Scaled dot-product attention:</strong> We divide by the square root of the embedding dimension to prevent extremely large values</li><li><strong>Causal masking:</strong> For language generation, we prevent words from attending to future words</li></ol><p>But the core concept remains exactly what we’ve implemented: computing how much each word should attend to every other word, then creating enriched representations based on those attention weights.</p><h3>The Bottom Line</h3><p>Self-attention is elegant in its simplicity. With just a few matrix multiplications, we can capture complex relationships between words that would be impossible to encode in a single fixed-size vector.</p><p>This breakthrough enabled:</p><ul><li>GPT models that can write coherent long-form content</li><li>Translation systems that handle complex sentences with ease</li><li>AI assistants that maintain context throughout long conversations</li></ul><p>Now that you’ve implemented it yourself, you understand the core mechanism behind modern AI. Everything else — deeper architectures, more sophisticated training methods — builds on this foundation.</p><p><em>Want to experiment further? Try modifying the input embeddings or using longer sentences. The beauty of self-attention is that it scales naturally to any sequence length — the same code works whether you have 5 words or 500.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9508b9c122eb" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Running LLMs on Modal: GPU-Powered Inference That Scales to Zero]]></title>
            <link>https://pguso.medium.com/running-llms-on-modal-gpu-powered-inference-that-scales-to-zero-635854513557?source=rss-b1afabf2359d------2</link>
            <guid isPermaLink="false">https://medium.com/p/635854513557</guid>
            <category><![CDATA[cloud-services]]></category>
            <category><![CDATA[generative-ai-tools]]></category>
            <category><![CDATA[serverless]]></category>
            <category><![CDATA[gpu]]></category>
            <category><![CDATA[llm]]></category>
            <dc:creator><![CDATA[Patric]]></dc:creator>
            <pubDate>Mon, 22 Sep 2025 16:28:08 GMT</pubDate>
            <atom:updated>2025-09-22T16:28:08.545Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Ppd4FV8yjM4wZCwrTaFddw.jpeg" /></figure><h4><em>From expensive 24/7 GPU servers to pay-per-token cloud inference — the serverless revolution hits AI</em></h4><p>Running Large Language Models (LLMs) in production has traditionally meant one thing: expensive, always-on GPU servers burning through your budget even when no one’s using them. AWS charges you $3+ per hour for a decent GPU instance whether it’s processing requests or sitting idle. Scale that across multiple models or regions, and you’re looking at thousands of dollars monthly before serving your first user.</p><p>What if you could run the same powerful LLMs but only pay for the exact seconds you use them? What if your inference infrastructure could scale from zero to handling hundreds of concurrent requests automatically? And what if setting this up took minutes, not weeks of DevOps work?</p><p>That’s exactly what we’ll build in this guide using Modal and llama.cpp. We’ll deploy a production-ready LLM inference service that streams tokens in real-time, automatically scales based on demand, and costs a fraction of traditional GPU hosting.</p><h3>Why Modal + llama.cpp is a Game Changer</h3><p><strong>Traditional LLM Hosting Problems:</strong></p><ul><li>GPU servers cost $2,000-$5,000+ per month, even when idle</li><li>Complex setup with Docker, Kubernetes, load balancers</li><li>Manual scaling means either wasted resources or poor performance</li><li>Long cold start times when scaling up</li><li>Managing CUDA versions, drivers, and dependencies</li></ul><p><strong>Our Modal Solution:</strong></p><ul><li>Pay only for inference time (seconds, not hours)</li><li>Automatic scaling from 0 to infinity</li><li>0 to 2,5 second cold starts with GPU snapshots</li><li>Zero infrastructure management</li><li>Built-in streaming and parallel request handling</li></ul><p>Let’s build it.</p><h3>The Complete LLM Inference Setup</h3><p>Here’s our full implementation that handles everything from model downloading to streaming inference:</p><pre>from typing import Optional<br>from pathlib import Path<br>import modal<br><br>app = modal.App(&quot;llms-llama-cpp&quot;)<br># Configuration<br>MODEL = &quot;Qwen3-Coder-30B-A3B-Instruct-Q2_K.gguf&quot;<br>GPU_CONFIG = &quot;A10&quot;<br>LLAMA_CPP_RELEASE = &quot;b4568&quot;<br>MINUTES = 60<br># CUDA environment setup<br>cuda_version = &quot;12.4.0&quot;<br>flavor = &quot;devel&quot;  # includes full CUDA toolkit<br>operating_sys = &quot;ubuntu22.04&quot;<br>tag = f&quot;{cuda_version}-{flavor}-{operating_sys}&quot;<br># Build the inference image with llama.cpp<br>image = (<br>    modal.Image.from_registry(f&quot;nvidia/cuda:{tag}&quot;, add_python=&quot;3.12&quot;)<br>    .apt_install(&quot;git&quot;, &quot;build-essential&quot;, &quot;cmake&quot;, &quot;curl&quot;, &quot;libcurl4-openssl-dev&quot;)<br>    .run_commands(&quot;git clone https://github.com/ggerganov/llama.cpp&quot;)<br>    .run_commands(<br>        &quot;cmake llama.cpp -B llama.cpp/build &quot;<br>        &quot;-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON &quot;<br>    )<br>    .run_commands(<br>        &quot;cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli&quot;<br>    )<br>    .run_commands(&quot;cp llama.cpp/build/bin/llama-* llama.cpp&quot;)<br>    .entrypoint([])<br>)</pre><p>This image setup is doing some heavy lifting:</p><ul><li>Starts with NVIDIA’s CUDA development image</li><li>Compiles llama.cpp from source with CUDA support</li><li>Builds the optimized CLI tools we need for inference</li></ul><h3>Persistent Model Storage</h3><p>The key to cost-effective LLM serving is avoiding expensive model re-downloads. Modal Volumes solve this perfectly:</p><pre># Persistent storage for our models<br>model_cache = modal.Volume.from_name(&quot;llamacpp-cache&quot;, create_if_missing=True)<br>cache_dir = &quot;/root/.cache/llama.cpp&quot;<br><br>download_image = (<br>    modal.Image.debian_slim(python_version=&quot;3.11&quot;)<br>    .pip_install(&quot;huggingface_hub[hf_transfer]==0.26.2&quot;)<br>    .env({&quot;HF_HUB_ENABLE_HF_TRANSFER&quot;: &quot;1&quot;})<br>)<br>@app.function(<br>    image=download_image,<br>    volumes={cache_dir: model_cache},<br>    timeout=1 * MINUTES<br>)<br>def download_model(repo_id, allow_patterns, revision: Optional[str] = None):<br>    from huggingface_hub import snapshot_download<br>    print(f&quot;🦙 downloading model from {repo_id} if not present&quot;)<br>    snapshot_download(<br>        repo_id=repo_id,<br>        revision=revision,<br>        local_dir=cache_dir,<br>        allow_patterns=allow_patterns,<br>    )<br>    model_cache.commit()  # persist to Modal Volume<br>    print(&quot;🦙 model loaded&quot;)</pre><p><strong>Why this matters:</strong></p><ul><li>Models are downloaded once and shared across all function instances</li><li>No bandwidth costs for repeated downloads</li><li>Faster cold starts since model weights are already available</li><li>Volume persists even when no functions are running (zero cost when idle)</li></ul><h3>Real-Time Streaming Inference</h3><p>Here’s where the magic happens — streaming tokens as they’re generated:</p><pre>@app.function(<br>    image=image,<br>    volumes={cache_dir: model_cache},<br>    gpu=GPU_CONFIG,<br>    timeout=1 * MINUTES,<br>)<br>def llama_cpp_stream(<br>        prompt: Optional[str] = None,<br>        model = MODEL,<br>        n_predict: int = -1<br>):<br>    import subprocess<br><br>    if prompt is None:<br>        prompt = &quot;Write a Python function to calculate fibonacci numbers:&quot;<br>    args = [&quot;--threads&quot;, &quot;8&quot;]<br>    n_gpu_layers = 64  # Use GPU for maximum layers<br>    command = [<br>        &quot;/llama.cpp/llama-cli&quot;,<br>        &quot;--model&quot;, f&quot;{cache_dir}/{model}&quot;,<br>        &quot;--n-gpu-layers&quot;, str(n_gpu_layers),<br>        &quot;--prompt&quot;, prompt,<br>        &quot;--n-predict&quot;, str(n_predict),<br>    ] + args<br>    process = subprocess.Popen(<br>        command,<br>        stdout=subprocess.PIPE,<br>        stderr=subprocess.PIPE,<br>        text=True,<br>        bufsize=1,<br>        universal_newlines=True<br>    )<br>    for line in process.stdout:<br>        yield line  # Stream each token as it&#39;s generated<br>    process.wait()<br>    if process.returncode != 0:<br>        stderr = process.stderr.read()<br>        raise RuntimeError(f&quot;llama.cpp failed: {stderr}&quot;)</pre><p>This function streams tokens in real-time, giving users immediate feedback rather than waiting for the entire response.</p><h3>Batch Inference with GPU Snapshots</h3><p>For maximum performance and cost optimization, we can use Modal’s GPU snapshots:</p><pre>@app.function(<br>    image=image,<br>    volumes={cache_dir: model_cache},<br>    gpu=GPU_CONFIG,<br>    timeout=30 * MINUTES,<br>    enable_memory_snapshot=True,<br>    experimental_options={&quot;enable_gpu_snapshot&quot;: True}<br>)<br>def llama_cpp_inference(<br>        prompt: Optional[str] = None,<br>        n_predict: int = -1,<br>):<br>    import subprocess<br><br>    if prompt is None:<br>        prompt = &quot;Explain quantum computing in simple terms:&quot;<br>    args = [&quot;--threads&quot;, &quot;8&quot;]<br>    n_gpu_layers = 64<br>    command = [<br>        &quot;/llama.cpp/llama-cli&quot;,<br>        &quot;--model&quot;, f&quot;{cache_dir}/{MODEL}&quot;,<br>        &quot;--n-gpu-layers&quot;, str(n_gpu_layers),<br>        &quot;--prompt&quot;, prompt,<br>        &quot;--n-predict&quot;, str(n_predict),<br>    ] + args<br>    print(&quot;🦙 running inference...&quot;)<br>    result = subprocess.run(<br>        command,<br>        stdout=subprocess.PIPE,<br>        stderr=subprocess.PIPE,<br>        text=True<br>    )<br>    if result.returncode != 0:<br>        raise RuntimeError(f&quot;llama.cpp failed: {result.stderr}&quot;)<br>    return result.stdout</pre><p><strong>GPU Snapshots are a game-changer:</strong></p><ul><li>Save the entire GPU memory state after model loading</li><li>Subsequent cold starts are 10x faster (~3 seconds vs 30+ seconds)</li><li>Perfect for high-traffic applications</li><li>Still pay only for actual usage time</li></ul><h3>Running Your LLM Service</h3><p>Let’s create a complete example that demonstrates both streaming and batch inference:</p><pre>import modal<br><br>download_model = modal.Function.from_name(&quot;llms-llama-cpp&quot;, &quot;download_model&quot;)<br>llama_cpp_stream = modal.Function.from_name(&quot;llms-llama-cpp&quot;, &quot;llama_cpp_stream&quot;)<br><br>try:<br>    # First, ensure the model is downloaded<br>    download_model.remote(<br>        repo_id=&quot;unsloth/gpt-oss-20b-GGUF&quot;,<br>        allow_patterns=[&quot;*Q6_K.gguf&quot;],<br>    )<br>    print(&quot;✅ Download completed successfully&quot;)<br>except Exception as e:<br>    if &quot;timeout&quot; in str(e).lower():<br>        print(&quot;⚠️  Download timed out, but model may still be cached&quot;)<br>        print(&quot;    Proceeding with inference...&quot;)<br>    else:<br>        print(f&quot;❌ Download failed: {e}&quot;)<br>        raise<br><br># Streaming inference<br>print(&quot;🚀 Starting streaming inference...&quot;)<br>prompt = &quot;Write a Python function to implement quicksort:&quot;<br><br>for chunk in llama_cpp_stream.remote_gen(prompt, n_predict=200):<br>    print(chunk, end=&quot;&quot;, flush=True)<br><br>print(&quot;\n&quot; + &quot;=&quot; * 50)</pre><p>Run this with:</p><pre>python llm_inference.py</pre><p>You can find the final code <a href="https://gist.github.com/pguso/5e635d12710b817846b9791f638bfdcb">here</a>.</p><h3>Cost Analysis: Modal vs Traditional Hosting</h3><p>Let’s break down the real cost differences:</p><h4>Traditional GPU Server (AWS p3.2xlarge)</h4><ul><li><strong>Base Cost:</strong> $3.06/hour × 24 hours × 30 days = <strong>$2,203/month</strong></li><li><strong>Utilization:</strong> Typically 10–20% for most applications</li><li><strong>Effective Cost:</strong> $11,000-$22,000 per month of actual usage</li><li><strong>Scaling:</strong> Manual, slow, requires load balancers</li></ul><h4>Modal LLM Inference (Nvidia T4)</h4><ul><li><strong>Idle Cost:</strong> $0 (true serverless)</li><li><strong>Inference Cost:</strong> ~$0.0001 per second of GPU time</li><li><strong>Example Usage:</strong> 1000 inferences/day, 3 seconds each</li><li>Daily: 1000 × 3 × $0.0001 = $0.30</li><li>Monthly: $0.30 × 30 = <strong>$9.00</strong></li><li><strong>Scaling:</strong> Automatic, unlimited parallelism</li></ul><h4>Real-World Scenario (Nvidia T4)</h4><p>For a typical AI application serving 10,000 requests per month (averaging 5 seconds each):</p><ul><li><strong>Traditional:</strong> $2,203+ per month (plus setup/maintenance costs)</li><li><strong>Modal:</strong> ~$50 per month</li><li><strong>Savings:</strong> 97% cost reduction</li></ul><h3>Performance Optimizations</h3><h4>Model Quantization</h4><p>Use appropriately quantized models for your use case:</p><pre># Different quantization levels vs performance trade-offs<br>QUANTIZATION_OPTIONS = {<br>    &quot;Q2_K&quot;: &quot;Smallest size, fastest inference, good quality&quot;,<br>    &quot;Q4_K_M&quot;: &quot;Balanced size/quality&quot;,  <br>    &quot;Q5_K_M&quot;: &quot;Better quality, larger size&quot;,<br>    &quot;Q8_0&quot;: &quot;Best quality, largest size&quot;<br>}</pre><h4>GPU Selection</h4><p>Choose GPUs based on your model size and performance needs:</p><pre>GPU_CONFIGS = {<br>    &quot;T4&quot;: &quot;Good for smaller models (&lt;7B parameters)&quot;,<br>    &quot;A10G&quot;: &quot;Great for medium models (7B-13B parameters)&quot;, <br>    &quot;A100&quot;: &quot;Required for large models (30B+ parameters)&quot;<br>}</pre><h4>Parallel Processing</h4><p>Maximize throughput with parallel inference:</p><pre>@app.local_entrypoint() <br>def batch_process():<br>    prompts = load_prompts_from_file(&quot;batch_requests.txt&quot;)<br>    <br>    # Process in batches of 10 for optimal GPU utilization<br>    batch_size = 10<br>    results = []<br>    <br>    for i in range(0, len(prompts), batch_size):<br>        batch = prompts[i:i+batch_size]<br>        batch_results = list(llama_cpp_inference.map(batch))<br>        results.extend(batch_results)<br>        <br>        print(f&quot;Processed batch {i//batch_size + 1}/{len(prompts)//batch_size}&quot;)<br>    <br>    return results</pre><h3>Production Considerations</h3><h4>Error Handling and Retries</h4><pre>@app.function(<br>    image=image,<br>    volumes={cache_dir: model_cache},<br>    gpu=GPU_CONFIG,<br>    retries=3,  # Auto-retry on failures<br>)<br>def robust_inference(prompt: str, n_predict: int = 100):<br>    try:<br>        return llama_cpp_inference.local(prompt, n_predict)<br>    except Exception as e:<br>        print(f&quot;Inference failed: {e}&quot;)<br>        # Log error, potentially fallback to different model<br>        raise</pre><h4>Monitoring and Logging</h4><pre>@app.function(<br>    image=image,<br>    volumes={cache_dir: model_cache},<br>    gpu=GPU_CONFIG,<br>)<br>def monitored_inference(prompt: str):<br>    import time<br>    <br>    start_time = time.time()<br>    token_count = len(prompt.split())<br>    <br>    print(f&quot;Starting inference for {token_count} input tokens&quot;)<br>    <br>    result = llama_cpp_inference.local(prompt)<br>    <br>    duration = time.time() - start_time<br>    output_tokens = len(result.split())<br>    <br>    print(f&quot;Inference completed: {output_tokens} tokens in {duration:.2f}s&quot;)<br>    print(f&quot;Throughput: {output_tokens/duration:.1f} tokens/second&quot;)<br>    <br>    return {<br>        &quot;response&quot;: result,<br>        &quot;metrics&quot;: {<br>            &quot;input_tokens&quot;: token_count,<br>            &quot;output_tokens&quot;: output_tokens,<br>            &quot;duration_seconds&quot;: duration,<br>            &quot;tokens_per_second&quot;: output_tokens/duration<br>        }<br>    }</pre><h3>The Future of LLM Inference</h3><p>This Modal + llama.cpp setup represents the future of LLM deployment:</p><p><strong>Immediate Benefits:</strong></p><ul><li>90%+ cost savings compared to traditional hosting</li><li>Zero infrastructure management</li><li>Automatic scaling and load balancing</li><li>Real-time streaming capabilities</li><li>Support for multiple models and use cases</li></ul><p><strong>Long-term Advantages:</strong></p><ul><li>As Modal adds more GPU types, you get access automatically</li><li>Performance improvements in llama.cpp benefit your deployment immediately</li><li>No vendor lock-in — your code runs anywhere Modal runs</li><li>Built-in observability and debugging tools</li></ul><h3>Getting Started</h3><p>Ready to deploy your own serverless LLM inference? Here’s your action plan:</p><ol><li><strong>Sign up for Modal</strong> at <a href="https://modal.com">modal.com</a></li><li><strong>Install the CLI:</strong> pip install modal</li><li><strong>Set up authentication:</strong> modal token new</li><li><strong>Clone the example:</strong> Copy the code from this article</li><li><strong>Deploy:</strong> modal run llm_inference.py</li></ol><p>Within minutes, you’ll have a production-ready LLM service that scales automatically and costs a fraction of traditional GPU hosting.</p><p>The serverless revolution has finally reached AI inference. Modal makes it possible to run powerful language models with the same ease as calling a function — because that’s exactly what it is.</p><p><em>Ready to stop paying for idle GPUs? Your LLM inference service is just one decorator away.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=635854513557" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Modal: AWS Power + GPU Speed = Cloud Computing Unleashed]]></title>
            <link>https://pguso.medium.com/modal-aws-power-gpu-speed-cloud-computing-unleashed-866835f03f01?source=rss-b1afabf2359d------2</link>
            <guid isPermaLink="false">https://medium.com/p/866835f03f01</guid>
            <category><![CDATA[gpu]]></category>
            <category><![CDATA[cloud-computing]]></category>
            <category><![CDATA[serverless]]></category>
            <category><![CDATA[modal]]></category>
            <category><![CDATA[ai-tools]]></category>
            <dc:creator><![CDATA[Patric]]></dc:creator>
            <pubDate>Mon, 22 Sep 2025 15:55:27 GMT</pubDate>
            <atom:updated>2025-09-22T15:55:27.908Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wZrcoUw_Z4oBgpLCkVekOg.jpeg" /></figure><h4>From local code to cloud GPUs with just a decorator — no Docker, no DevOps, no headaches</h4><p>Remember the last time you tried to deploy a simple Python function to AWS? The endless YAML configurations, Docker builds that mysteriously break, IAM roles that make your head spin, and don’t even get started on trying to get a GPU instance running. By the time you’re done, you’ve forgotten what you were trying to build in the first place.</p><p>Modal flips this entire experience on its head. What if deploying to the cloud was as simple as adding @app.function() above your Python function? What if you could grab a GPU-powered instance without wrestling with capacity reservations, instance types, or AMI images? What if &quot;going to production&quot; felt more like running a local script than managing a small army of cloud services?</p><p>That’s exactly what Modal delivers. It’s cloud computing designed for the modern developer who wants to build AI applications, process data, or scale compute-intensive workloads without becoming a DevOps expert first. While AWS gives you infinite flexibility (and infinite complexity), Modal gives you infinite simplicity with the power you actually need.</p><p>In this guide, we’ll show you how Modal transforms cloud development from a multi-day infrastructure project into a five-minute coding session. You’ll see how to go from a simple Python function to a scalable, GPU-accelerated cloud service faster than you can spin up an EC2 instance.</p><p>Let’s dive in and see why developers are calling Modal “the cloud platform that actually gets it.”</p><h3>Setup</h3><p>Head over to <a href="https://modal.com/signup">https://modal.com/signup</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hhu5Yb2XzG00VsOjxRPFxg.png" /></figure><p>The quickest way of sign up to Modal is using your existing GitHub or Google account. You will get $5 credit. If you add a payment method you will get every month $30 credit.</p><p>You will see the instructions on the screen on how to setup modal to get started, here you can see the text version:</p><p>Run this in order to install the Python library locally:</p><pre>pip install modal</pre><pre>python3 -m modal setup</pre><p>The first command will install the Modal client library on your computer, along with its dependencies.</p><p>The second command creates an API token by authenticating through your web browser. It will open a new tab, but you can close it when you are done.</p><h3>Introduction</h3><p>Open your prefered IDE and add a file hello_world.py follow along to get a first running script.</p><p>Configure a <strong>App</strong> that will run on Modal.</p><blockquote>It groups one or more Functions for atomic deployment and acts as a shared namespace. All Functions and Classes are associated with an App.</blockquote><pre>import sys<br><br>import modal<br><br>app = modal.App(&quot;example-hello-world&quot;)</pre><p>A <strong>Function</strong> runs independently and scales on its own. If it has no live inputs, it won’t use any containers or incur costs, even if its App is still deployed.</p><p>Modal lets you run code in the cloud. To get started, write a simple function that logs something to the console. To make it work with Modal, just add the @app.function decorator above it.</p><pre>@app.function()<br>def f():<br>    print(&quot;Hello world!&quot;)</pre><p><strong>Running the function locally, in the cloud, and in parallel</strong><br> We can call the function in three ways:</p><ul><li><strong>Locally</strong> on your own machine using f.local</li><li><strong>Remotely</strong> in the cloud using f.remote</li><li><strong>In parallel</strong> across many inputs in the cloud using f.map</li></ul><p>The example below shows how to use locally and remotly inside the main function.</p><pre>@app.local_entrypoint()<br>def main():<br>    # run the function locally<br>    print(f.local())<br><br>    # run the function remotely on Modal<br>    print(f.remote())</pre><h4>Running with modal run</h4><p>When you enter modal run hello_world.py in your shell, Modal automatically starts an app, runs the main function, and shows its logs. Alongside these logs, you’ll also see logs from f — first when it runs locally, then remotely, and finally in parallel in the cloud.</p><p>This behavior comes from the @app.local_entrypoint decorator on main.</p><ul><li>It marks main as the <strong>CLI entrypoint</strong> for your Modal app.</li><li>When you call modal run, Modal knows to start from this function.</li><li>Unlike a regular Modal function (which runs only in the cloud), a local_entrypoint runs locally and can <strong>orchestrate other Modal functions</strong>.</li></ul><p>Extra capabilities of local_entrypoint:</p><ul><li><strong>Multiple entrypoints</strong>: You can define more than one and run them with modal run app_module.py::app.function_name.</li><li><strong>Argument parsing</strong>: If your entrypoint accepts arguments (str, int, float, bool, datetime), Modal automatically parses CLI options. For example:</li></ul><pre>@app.local_entrypoint()<br>def main(foo: int, bar: str):<br>    some_modal_function.call(foo, bar)</pre><p>You can run it with:</p><pre>modal run app_module.py --foo 1 --bar &quot;hello&quot;</pre><p>You don’t need an explicit app.run(). Modal creates and runs the app for you when you invoke modal run.</p><p>Let’s now look at an example to run a function in parallel:</p><pre>import modal<br><br>app = modal.App(&quot;example-map&quot;)<br><br>@app.function()<br>def f(x):<br>    print(f&quot;Processing {x}&quot;)<br>    return x * 2<br><br>@app.local_entrypoint()<br>def main():<br>    inputs = [1, 2, 3, 4]   # the function will run once for each input<br>    results = list(f.map(inputs))  # runs in parallel in the cloud<br>    print(&quot;Results:&quot;, results)</pre><p>Output:</p><pre>Processing 1<br>Processing 2<br>Processing 3<br>Processing 4<br>Results: [2, 4, 6, 8]</pre><p><strong>Notes:</strong></p><ol><li>Each element of inputs is passed to f() as an argument.</li><li><a href="https://modal.com/docs/reference/modal.Function#map">map()</a> runs as many functions in parallel as there are inputs (you can have multiple arguments too by passing multiple iterators).</li><li>order_outputs=True (default) ensures results are in the same order as inputs; set it to False to get results as soon as each execution finishes.</li><li>If you want to handle errors without crashing, use return_exceptions=True.</li><li>Use <a href="https://modal.com/docs/reference/modal.Function#for_each">for_each</a> like map to run a function for each input when you don’t need the results, since it automatically waits for all executions to finish.</li></ol><p>Modal lets you run code in the cloud as easily as running it locally — no waiting for builds, pushing containers, or switching to a web UI to check logs.</p><h4>Ephemeral Apps</h4><p>An <strong>ephemeral App</strong> is a temporary Modal app that exists only while your script is running. It’s created when you use modal run or app.run(), and stops automatically when the script ends or the client disconnects. You can keep it running after the script ends using --detach.</p><p>You can also control logs and output using modal.enable_output().</p><p>Example using app.run()</p><pre>import modal<br><br>app = modal.App(&quot;example-hello-world&quot;)<br><br>@app.function()<br>def f():<br>    print(&quot;Hello world!&quot;)<br>    return &quot;done&quot;<br><br>@app.local_entrypoint()<br>def main():<br>    with modal.enable_output():  # show logs and progress<br>        with app.run():  # start ephemeral app<br>            # run the function locally<br>            print(f.local())<br><br>            # run the function remotely on Modal<br>            print(f.remote())</pre><p><strong>Explanation:</strong></p><ul><li>app.run() creates the ephemeral app inside your script.</li><li>modal.enable_output() makes logs from f() visible in the terminal.</li><li>f.local() runs the function on your machine, f.remote() runs it in the cloud.</li></ul><p>This is equivalent to running your script with modal run hello_world.py, but now fully controlled from Python.</p><h3>Configuring Resources and Environment</h3><p>Now let’s get into the real power of Modal — configuring your cloud environment without the traditional headaches.</p><h4>Custom Images and Dependencies</h4><p>Modal uses container images, but you don’t need to write Dockerfiles. Instead, you define your environment programmatically:</p><pre>import modal<br><br># Create a custom image with your dependencies<br>image = modal.Image.debian_slim().uv_pip_install(<br>    &quot;numpy&quot;, <br>    &quot;pandas&quot;, <br>    &quot;scikit-learn&quot;,<br>    &quot;torch&quot;<br>)<br><br>app = modal.App(&quot;ml-example&quot;)<br><br>@app.function(image=image)<br>def train_model(data):<br>    import numpy as np<br>    import pandas as pd<br>    from sklearn.linear_model import LinearRegression<br>    <br>    # Your ML code here<br>    model = LinearRegression()<br>    # ... training logic<br>    return &quot;Model trained successfully!&quot;</pre><h4>GPU Configuration Made Simple</h4><p>Here’s where Modal really shines. Getting GPU access is as simple as adding a parameter:</p><pre>@app.function(gpu=&quot;T4&quot;)  # or &quot;A10G&quot;, &quot;L40S&quot;, etc.<br>def gpu_accelerated_task():<br>    import torch<br>    <br>    if torch.cuda.is_available():<br>        device = torch.device(&quot;cuda&quot;)<br>        print(f&quot;Using GPU: {torch.cuda.get_device_name(0)}&quot;)<br>        <br>        # Your GPU-accelerated code here<br>        tensor = torch.randn(1000, 1000).to(device)<br>        result = torch.matmul(tensor, tensor)<br>        <br>        return f&quot;Computed on {device}&quot;<br>    else:<br>        return &quot;No GPU available&quot;</pre><p>You can find all available GPUs and their prices <a href="https://modal.com/pricing">here</a>.</p><h4>Working with Secrets and Environment Variables</h4><p>Modal handles secrets securely without exposing them in your code:</p><pre># First, create secrets in the Modal dashboard or CLI<br># modal secret create my-api-key API_KEY=your_secret_key<br><br>@app.function(secrets=[modal.Secret.from_name(&quot;my-api-key&quot;)])<br>def call_external_api():<br>    import os<br>    import requests<br>    <br>    api_key = os.environ[&quot;API_KEY&quot;]<br>    response = requests.get(f&quot;https://api.example.com/data?key={api_key}&quot;)<br>    return response.json()</pre><h4>Persistent Storage with Volumes</h4><p>For data that needs to persist between function calls:</p><pre>vol = modal.Volume.from_name(&quot;my-volume&quot;)<br><br>@app.function(volumes={&quot;/data&quot;: vol})<br>def run():<br>    with open(&quot;/data/xyz.txt&quot;, &quot;w&quot;) as f:<br>        f.write(&quot;hello&quot;)<br>    vol.commit()  # Needed to make sure all changes are persisted before exit</pre><h3>Why Modal Wins</h3><p>After working through these examples, the Modal advantage becomes clear:</p><p><strong>Traditional Cloud Deployment:</strong></p><ul><li>Write Dockerfile</li><li>Set up CI/CD pipeline</li><li>Configure load balancers</li><li>Manage auto-scaling</li><li>Monitor resource usage</li><li>Handle secrets management</li><li>Debug across multiple services</li><li>Wait for builds and deployments</li></ul><p><strong>Modal Deployment:</strong></p><ul><li>Add @app.function() decorator</li><li>Run modal run script.py</li><li>Done.</li></ul><p>Modal abstracts away the infrastructure complexity while giving you access to the full power of cloud computing, including GPUs. You focus on writing Python code that solves your problems, not managing containers and orchestration systems.</p><p>For AI developers, data scientists, and anyone building compute-intensive applications, Modal represents a fundamental shift in how we think about cloud development. It’s not just easier — it’s what cloud computing should have been from the beginning.</p><p>Ready to experience cloud computing without the pain? Head to <a href="https://modal.com">modal.com</a> and see what you can build in the next five minutes.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=866835f03f01" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The Three Phases of Open Source AI: From Bigger to Smarter]]></title>
            <link>https://pguso.medium.com/the-three-phases-of-open-source-ai-from-bigger-to-smarter-e23c8f7a8930?source=rss-b1afabf2359d------2</link>
            <guid isPermaLink="false">https://medium.com/p/e23c8f7a8930</guid>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[open-source]]></category>
            <dc:creator><![CDATA[Patric]]></dc:creator>
            <pubDate>Sat, 20 Sep 2025 17:43:17 GMT</pubDate>
            <atom:updated>2025-09-20T17:43:17.936Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Xk4ERR6Pl81eRp_Bp56Knw.jpeg" /></figure><p><em>How the AI industry learned that size isn’t everything</em></p><p>Imagine if someone told you that a 671-billion parameter AI model could run cheaper than a 70-billion parameter one. Three years ago, that would have sounded impossible. Today, it’s reality — and it represents one of the most fascinating pivots in the history of artificial intelligence.</p><p>The story of open source Large Language Models (LLMs) over the past three years isn’t just about technology; it’s about an entire industry learning, adapting, and ultimately discovering that the path to better AI isn’t always “make it bigger.”</p><p><strong>What Are LLMs and Why Do Parameters Matter?</strong></p><p>Before we dive into the evolution, let’s establish the basics. Large Language Models are AI systems trained on massive amounts of text to understand and generate human-like language. Think of them as incredibly sophisticated autocomplete systems that can write essays, answer questions, and even code.</p><p><strong>Parameters</strong> are like the “brain cells” of these models — they’re the mathematical weights that determine how the AI processes information. For the longest time, the industry believed in a simple equation: more parameters = smarter AI.</p><p>This belief drove what I call the “parameter arms race,” where companies competed to build the biggest models possible. But as we’ll see, this race had an unexpected ending.</p><p><strong>📈 Phase 1: The Scaling Race (2022-mid 2024)</strong></p><p><strong>Mentality: “More parameters = better performance”</strong></p><p>In 2022, the AI world was captivated by a simple idea: bigger is better. OpenAI’s GPT-3 had 175 billion parameters and seemed magical. The logical conclusion? Build even bigger models.</p><p><strong>The Giants of the Scaling Era</strong></p><p><strong>BLOOM (176B parameters) — July 2022</strong> The first truly massive open-source model, BLOOM was built by a consortium of researchers who pooled resources to create something that could compete with GPT-3. It required enormous computational power and could barely fit on the most powerful hardware available.</p><p><strong>LLaMA 1 (up to 65B parameters) — February 2023</strong> Meta’s LLaMA marked a shift toward more efficient scaling, but still followed the “bigger is better” philosophy with models ranging from 7B to 65B parameters.</p><p><strong>Falcon (180B parameters) — June 2023</strong> The UAE’s Technology Innovation Institute pushed even further, creating one of the largest dense models of its time.</p><p><strong>LLaMA 3.1 (405B parameters) — July 2024</strong> The summit of the scaling race. Meta’s 405B model required massive infrastructure and represented the pinnacle of “brute force” AI scaling.</p><p><strong>Nemotron (340B parameters) — June 2024</strong> NVIDIA’s contribution to the arms race, another massive model requiring enormous computational resources.</p><p><strong>The Growing Problems</strong></p><p>As models grew larger, several critical issues became apparent:</p><p><strong>Infrastructure Nightmares</strong>: A 405B parameter model needs multiple high-end GPUs just to load into memory, let alone run efficiently. Most companies simply couldn’t afford the hardware.</p><p><strong>Astronomical Costs</strong>: Running these models for inference (generating responses) cost thousands of dollars per day. Only the biggest tech companies could sustain this.</p><p><strong>Deployment Impossibility</strong>: Want to run a 400B model on your laptop? Forget it. Even running it in the cloud required specialized, expensive setups.</p><p><strong>Diminishing Returns</strong>: The performance gains weren’t always proportional to the size increases. A 400B model wasn’t necessarily twice as good as a 200B model.</p><p>By mid-2024, it was becoming clear that the scaling race was hitting fundamental limits — not of technology, but of practicality.</p><p><strong>🧠 Phase 2: The MoE Revolution (Late 2023–2024)</strong></p><p><strong>Mentality: “Smart scaling beats brute scaling”</strong></p><p>Just when it seemed like bigger models were the only path forward, a different approach emerged: Mixture-of-Experts (MoE) architecture. This innovation would completely change how we think about model size.</p><p><strong>Understanding MoE: The Game Changer</strong></p><p>Imagine a university with 8 different professors, each an expert in a specific subject. When a student asks a question, instead of consulting all 8 professors, the university director routes the question to just the 2 most relevant experts. The university has the knowledge capacity of 8 professors but only pays the consultation cost of 2.</p><p>That’s essentially how MoE models work. They have multiple “expert” networks, but for any given input, they only activate a subset of them.</p><p><strong>The MoE Pioneers</strong></p><p><strong>Mixtral 8x7B (47B total, 13B active) — December 2023</strong> Mistral AI’s breakthrough model proved the MoE concept worked in practice. With 8 expert networks of 7B parameters each, it had the capacity of a 47B model but only used 13B parameters for any single prediction. The result? Near-GPT-3.5 performance at a fraction of the computational cost.</p><p><strong>DeepSeek-V2 (236B total, 21B active) — May 2024</strong> Chinese AI lab DeepSeek pushed MoE further, creating a model with massive capacity that remained surprisingly efficient to run.</p><p><strong>DeepSeek-V3 (671B total, 37B active) — December 2024</strong> The crown jewel of MoE evolution. Despite having more parameters than any model before it, DeepSeek-V3 runs more efficiently than many smaller, traditional models. It’s like having a massive library but only needing to read the relevant books for each question.</p><p><strong>The Breakthrough Insight</strong></p><p>MoE models revealed a crucial insight: <strong>you can separate model capacity from computational cost</strong>. Traditional thinking assumed these were locked together — more capacity always meant more compute. MoE proved this wrong.</p><p>Suddenly, the parameter count became a misleading metric. DeepSeek-V3’s 671B parameters sounds massive, but it only uses 37B at a time, making it more efficient than traditional 70B models.</p><p><strong>⚡ Phase 3: The Efficiency Era (2024–2025)</strong></p><p><strong>Mentality: “Right-sized models for real-world deployment”</strong></p><p>The third phase represents the maturation of the field. The industry stopped asking “How big can we make it?” and started asking “How efficiently can we solve real problems?”</p><p><strong>The Small Model Renaissance</strong></p><p><strong>Mistral 7B — The David Among Goliaths</strong> Released in September 2023, this 7.3B parameter model shocked the AI world by outperforming much larger competitors. How? Superior training data, better algorithms, and focused optimization. It proved that smart engineering could beat brute force.</p><p><strong>Microsoft’s Phi Series — Proving Small Can Be Mighty</strong></p><ul><li>Phi-2 (2.7B parameters): Outperformed models 10x its size</li><li>Phi-3 (3.8B-14B parameters): Continued the trend of efficient small models</li></ul><p>These models showed that with the right approach, you could achieve excellent performance while being deployable on everything from cloud servers to laptops.</p><p><strong>Qwen 2.5 — The Complete Spectrum Approach</strong> Alibaba’s Qwen series offered models from 0.5B to 72B parameters, recognizing that different applications need different capabilities. A chatbot for customer service doesn’t need the same power as a research assistant.</p><p><strong>Why the Industry Pivoted</strong></p><p>Several factors drove this shift toward efficiency:</p><p><strong>1. Deployment Reality Check</strong></p><p>Companies realized that 405B parameter models, while impressive, were impractical for most real-world applications. The infrastructure costs alone made them accessible only to tech giants with massive resources.</p><p><strong>The Math Problem</strong>: Running a 405B model requires multiple A100 or H100 GPUs, each costing $10,000-$40,000. Most businesses couldn’t justify this expense.</p><p><strong>The Accessibility Issue</strong>: If only a handful of companies can afford to run your AI model, you’re not democratizing AI — you’re creating an exclusive club.</p><p><strong>2. The MoE Breakthrough</strong></p><p>MoE models proved that architectural innovation could be more powerful than simply adding parameters. DeepSeek-V3 demonstrated that a 671B parameter model could run more efficiently than traditional 70B models — a complete paradigm shift.</p><p><strong>3. Training Efficiency Advances</strong></p><p>The industry developed better ways to train models:</p><p><strong>Quality Over Quantity</strong>: Instead of feeding models more data, researchers focused on higher-quality, more carefully curated datasets.</p><p><strong>Improved Techniques</strong>: New training methods like better tokenization, improved attention mechanisms, and more efficient architectures allowed smaller models to achieve better results.</p><p><strong>Specialized Training</strong>: Models began being trained for specific tasks rather than trying to be everything to everyone.</p><p><strong>4. Market Demands</strong></p><p>Real-world deployment requirements drove the efficiency push:</p><p><strong>Edge Computing</strong>: Companies wanted AI that could run on phones, tablets, and edge devices, not just massive server farms.</p><p><strong>Cost-Conscious Enterprises</strong>: Businesses needed AI solutions that fit their budgets, not just their ambitions.</p><p><strong>Developer-Friendly Models</strong>: Programmers wanted models they could actually experiment with and deploy, not models that required a PhD to operate.</p><p><strong>The Current State: Beyond the Parameter Wars</strong></p><p>Today’s AI landscape looks radically different from the scaling race of 2022–2023. The winners aren’t necessarily the biggest models, but the smartest ones.</p><p><strong>The New Champions</strong></p><p><strong>DeepSeek-V3</strong>: Represents the peak of MoE efficiency — massive capacity when needed, efficient operation always.</p><p><strong>Qwen 2.5</strong>: Offers the complete spectrum from ultra-lightweight (0.5B) to flagship (72B), recognizing that one size doesn’t fit all.</p><p><strong>Mistral 7B</strong>: Continues to punch above its weight class, proving that focused optimization beats raw scaling.</p><p><strong>Mixtral Series</strong>: Pioneered open-source MoE and continues to push the boundaries of efficient scaling.</p><p><strong>The New Success Metrics</strong></p><p>The industry has moved beyond simple parameter counting to more meaningful metrics:</p><p><strong>Performance per Dollar</strong>: How much capability do you get for your compute budget?</p><p><strong>Deployment Feasibility</strong>: Can real companies actually use this model?</p><p><strong>Task-Specific Optimization</strong>: How well does it perform on the specific tasks that matter?</p><p><strong>Active vs. Total Parameters</strong>: For MoE models, what matters is not total capacity but active compute per inference.</p><p><strong>Looking Forward: The Post-Scaling Era</strong></p><p>The parameter wars taught the AI industry valuable lessons, and the future reflects this newfound wisdom:</p><p><strong>Smart Architectures Over Brute Force</strong></p><p>The future belongs to innovations like MoE, efficient attention mechanisms, and other architectural advances that maximize capability while minimizing computational requirements.</p><p><strong>Right-Sized Models for Specific Use Cases</strong></p><p>Instead of one massive model trying to do everything, we’re seeing specialized models optimized for particular tasks — coding assistants, writing helpers, scientific research tools, and more.</p><p><strong>Deployment-First Thinking</strong></p><p>New models are being designed with real-world deployment in mind from day one, not as an afterthought.</p><p><strong>Democratization Through Efficiency</strong></p><p>By making models more efficient, the industry is making AI more accessible to smaller companies, researchers, and individual developers.</p><p><strong>Why This Evolution Matters</strong></p><p>The three-phase evolution of open source LLMs represents more than just technical progress — it’s a story about an industry learning to think differently about innovation.</p><p><strong>For Businesses</strong>: The shift toward efficiency means AI is becoming more accessible and affordable. You no longer need Google-sized budgets to deploy capable AI systems.</p><p><strong>For Developers</strong>: Efficient models mean you can experiment, iterate, and deploy AI solutions without massive infrastructure investments.</p><p><strong>For Society</strong>: More efficient AI means broader access, which leads to more innovation and more diverse applications.</p><p><strong>For the Future</strong>: The lessons learned from the parameter wars are shaping a more sustainable, practical approach to AI development.</p><p>The story of LLM evolution proves that in technology, as in life, bigger isn’t always better — smarter usually is. The parameter wars are over, and the efficiency era has just begun.</p><p><em>The AI industry’s journey from “bigger is better” to “smarter is better” mirrors many technological revolutions. Just as the computer industry moved from room-sized mainframes to powerful laptops, AI is learning that true progress comes not from brute force, but from elegant solutions to real problems.</em></p><p><strong>Key Takeaways</strong></p><ol><li><strong>Size isn’t everything</strong>: The AI industry learned that more parameters don’t automatically mean better performance</li><li><strong>Architecture matters</strong>: Innovations like MoE can provide massive capacity improvements without proportional cost increases</li><li><strong>Deployment drives design</strong>: Real-world constraints shape what models succeed in the market</li><li><strong>Efficiency enables access</strong>: More efficient models democratize AI by making it accessible to more organizations</li><li><strong>The future is specialized</strong>: Rather than one massive model for everything, we’re moving toward right-sized models for specific use cases</li></ol><p>The three phases of LLM evolution show us that the most important innovations often come not from doing more of the same thing, but from fundamentally rethinking the problem.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e23c8f7a8930" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>