Stories by Milvus on Medium

Beyond the TurboQuant-RaBitQ Debate: Why Vector Quantization Matters for AI Infrastructure Costs

Milvus — Thu, 18 Jun 2026 10:28:38 GMT

Google’s TurboQuant paper (ICLR 2026) reported 6x KV cache compression with near-zero accuracy loss — results striking enough to wipe $90 billion off memory chip stocks in a single day. SK Hynix dropped 12%. Samsung dropped 7%.

The paper quickly drew scrutiny. Jianyang Gao, first author of RaBitQ (SIGMOD 2024), raised questions about the relationship between TurboQuant’s methodology and his prior work on vector quantization. (We’ll be publishing a conversation with Dr. Gao soon — follow us if you’re interested.)

This article isn’t about taking sides in that discussion. What struck us is something bigger: the fact that a single vector quantization paper could move $90 billion in market value tells you how critical this technology has become for AI infrastructure. Whether it’s compressing KV cache in inference engines or compressing indexes in vector databases, the ability to shrink high-dimensional data while preserving quality has enormous cost implications — and it’s a problem we’ve been working on, integrating RaBitQ into Milvus vector database and turning it into production infrastructure.

Here’s what we’ll cover: why vector quantization matters so much right now, how TurboQuant and RaBitQ compare, what RaBitQ is and how it works, the engineering work behind shipping it inside Milvus, and what the broader memory optimization landscape looks like for AI infrastructure.

Why Does Vector Quantization Matter for Infrastructure Costs?

Vector quantization is not new. What’s new is how urgently the industry needs it. Over the past two years, LLM parameters have ballooned, context windows have stretched from 4K to 128K+ tokens, and unstructured data — text, images, audio, video — has become a first-class input to AI systems. Every one of these trends creates more high-dimensional vectors that need to be stored, indexed, and searched. More vectors, more memory, more cost.

If you’re running vector search at scale — RAG pipelines, recommendation engines, multimodal retrieval — memory cost is likely one of your biggest infrastructure headache.

During model deployment, every major LLM inference stack relies on KV cache — storing previously computed key-value pairs so the attention mechanism doesn’t recompute them for every new token. It’s what makes O(n) inference possible instead of O(n²). Every framework from vLLM to TensorRT-LLM depends on it. But KV cache can consume more GPU memory than the model weights themselves. Longer contexts, more concurrent users, and it spirals fast.

The same pressure hits vector databases — billions of high-dimensional vectors sitting in memory, each one a 32-bit float per dimension. Vector quantization compresses these vectors from 32-bit floats down to 4-bit, 2-bit, or even 1-bit representations — shrinking memory by 90% or more. Whether it’s KV cache in your inference engine or indexes in your vector database, the underlying math is the same, and the cost savings are real. That’s why a single paper reporting a breakthrough in this space moved $90 billion in stock market value.

TurboQuant vs RaBitQ: What’s the Difference?

Both TurboQuant and RaBitQ build on the same foundational technique: applying a random rotation (Johnson-Lindenstrauss transform) to input vectors before quantization. This rotation transforms irregularly distributed data into a predictable uniform distribution, making it easier to quantize with low error.

Beyond that shared foundation, the two target different problems and take different approaches:

The key difference for practitioners: TurboQuant optimizes the transient KV cache inside an inference engine, while RaBitQ targets the persistent indexes that a vector database builds, shards, and queries across billions of vectors. For the rest of this article, we’ll focus on RaBitQ — the algorithm we’ve integrated and ship in production inside Milvus.

What Is RaBitQ and What Does It Deliver?

Here’s the bottom line first: on a 10-million vector dataset at 768 dimensions, RaBitQ compresses each vector to 1/32 of its original size while keeping recall above 94%. In Milvus, that translates to 3.6x higher query throughput than a full-precision index. This isn’t a theoretical projection — it’s a benchmark result from Milvus 2.6.

Now, how it gets there.

Traditional binary quantization compresses FP32 vectors to 1 bit per dimension — 32x compression. The tradeoff: recall collapses because you’ve thrown away too much information. RaBitQ (Gao & Long, SIGMOD 2024) keeps the same 32x compression but preserves the information that actually matters for search. An extended version (Gao & Long, SIGMOD 2025) proves this is asymptotically optimal, matching the theoretical lower bounds established by Alon & Klartag (FOCS 2017).

Why Do Angles Matter More Than Coordinates in High Dimensions?

The key insight: in high dimensions, angles between vectors are more stable and informative than individual coordinate values. This is a consequence of measure concentration — the same phenomenon that makes Johnson-Lindenstrauss random projections work.

What this means in practice: you can discard the exact coordinate values of a high-dimensional vector and keep only its direction relative to the dataset. The angular relationships — which is what nearest-neighbor search actually depends on — survive the compression.

How Does RaBitQ Work?

RaBitQ turns this geometric insight into three steps:

Step 1: Normalize. Center each vector relative to the dataset centroid and scale to unit length. This converts the problem to inner-product estimation between unit vectors — easier to analyze and bound.

Step 2: Random rotation + hypercube projection. Apply a random orthogonal matrix (a Johnson-Lindenstrauss-type rotation) to remove bias toward any axis. Project each rotated vector onto the nearest vertex of a {±1/√D}^D hypercube. Each dimension collapses to a single bit. The result: a D-bit binary code per vector.

Step 3: Unbiased distance estimation. Construct an estimator for the inner product between a query and the original (unquantized) vector. The estimator is provably unbiased with error bounded by O(1/√D). For 768-dimensional vectors, this keeps recall above 94%.

Distance computation between binary vectors reduces to bitwise AND + popcount — operations modern CPUs execute in a single cycle. This is what makes RaBitQ fast, not just small.

Why Is RaBitQ Practical, Not Just Theoretical?

No training required. Apply the rotation, check signs. No iterative optimization, no codebook learning. Indexing time is comparable to product quantization.
Hardware-friendly. Distance computation is bitwise AND + popcount. Modern CPUs (Intel IceLake+, AMD Zen 4+) have dedicated AVX512VPOPCNTDQ instructions. Single-vector estimation runs 3x faster than PQ lookup tables.
Multi-bit flexibility. The RaBitQ Library supports variants beyond 1-bit: 4-bit achieves ~90% recall, 5-bit ~95%, 7-bit ~99% — all without reranking.
Composable. Plugs into existing index structures like IVF indexes and HNSW graphs, and works with FastScan for batch distance computation.

From Paper to Production: What We Built to Ship RaBitQ in Milvus

The original RaBitQ code is a single-machine research prototype. Making it work across a distributed cluster with sharding, failover, and real-time ingestion required solving four engineering problems. At Zilliz, we went beyond simply implementing the algorithm — the work spanned engine integration, hardware acceleration, index optimization, and runtime tuning to turn RaBitQ into an industrial-grade capability inside Milvus. You can find more details in this blog as well: Bring Vector Compression to the Extreme: How Milvus Serves 3× More Queries with RaBitQ

Making RaBitQ Distributed-Ready

We integrated RaBitQ directly into Knowhere, Milvus’s core search engine — not as a plugin, but as a native index type with unified interfaces. It works with Milvus’s full distributed architecture: sharding, partitioning, dynamic scaling, and collection management.

The key challenge: making the quantization codebook (rotation matrix, centroid vectors, scaling parameters) segment-aware, so that each shard builds and stores its own quantization state. Index builds, compactions, and load-balancing all understand the new index type natively.

Squeezing Every Cycle Out of Popcount

RaBitQ’s speed comes from popcount — counting set bits in binary vectors. The algorithm is inherently fast, but how much throughput you get depends on how well you use the hardware. We built dedicated SIMD code paths for both dominant server architectures:

x86 (Intel IceLake+ / AMD Zen 4+): AVX-512’s VPOPCNTDQ instruction computes popcount across multiple 512-bit registers in parallel. Knowhere’s inner loops are restructured to batch binary distance computations into SIMD-width chunks, maximizing throughput.
ARM (Graviton, Ampere): SVE (Scalable Vector Extension) instructions for the same parallel popcount pattern — critical since ARM instances are increasingly common in cost-optimized cloud deployments.

Eliminating Runtime Overhead

RaBitQ needs auxiliary floating-point parameters at query time: the dataset centroid, per-vector norms, and the inner product between each quantized vector and its original (used by the distance estimator). Computing these per query adds latency. Storing the full original vectors defeats the purpose of compression.

Our solution: pre-compute and persist these parameters during index build, caching them alongside the binary codes. The memory overhead is small (a few floats per vector), but it eliminates per-query computation and keeps latency stable under high concurrency.

IVF_RABITQ: The Index You Actually Deploy

Starting with Milvus 2.6, we ship IVF_RABITQ — Inverted File Index + RaBitQ quantization. The search works in two stages:

Coarse search (IVF). K-means partitions the vector space into clusters. At query time, only the nprobe closest clusters are scanned.
Fine scoring (RaBitQ). Within each cluster, distances are estimated using 1-bit codes and the unbiased estimator. Popcount does the heavy lifting.

The results on a 768-dimensional, 10-million vector dataset:

For workloads that can’t tolerate even a 0.5% recall gap, the refine_type parameter adds a second scoring pass: SQ6, SQ8, FP16, BF16, or FP32. SQ8 is the common choice — it restores recall to IVF_FLAT levels at roughly 1/4 the original memory. You can also apply scalar quantization to the query side (SQ1–SQ8) independently, giving you two knobs to tune the latency-recall-cost tradeoff per workload.

How Milvus Optimizes Memory Beyond Quantization

RaBitQ is the most dramatic compression lever, but it’s one layer in a broader memory optimization stack:

Full-Stack Quantization

RaBitQ’s 1-bit extreme compression isn’t the right fit for every workload. Milvus offers a complete quantization matrix: SQ8 and product quantization (PQ) for workloads that need a balanced precision-cost tradeoff, RaBitQ for maximum compression on ultra-large datasets, and hybrid configurations that combine multiple methods for fine-grained control.

Index Structure Optimization

Beyond quantization, Milvus continuously optimizes memory overhead in its core index structures. For HNSW, we reduced adjacency list redundancy to lower per-graph memory usage. DiskANN pushes both vector data and index structures to SSD, dramatically reducing DRAM dependency for large datasets. We also optimized intermediate memory allocation during index building to prevent OOM failures when building indexes over datasets that approach node memory limits.

Smart Memory Loading

Milvus’s mmap (memory-mapped I/O) support maps vector data to disk files, relying on the OS page cache for on-demand loading — no need to load all data into memory at startup. Combined with lazy loading and segmented loading strategies that prevent sudden memory spikes, this enables smooth operation with TB-scale vector datasets at a fraction of the memory cost.

Tiered Storage

Milvus’s three-tier storage architecture spans memory, SSD, and object storage: hot data stays in memory for low latency, warm data is cached on SSD for a balance of performance and cost, and cold data sinks to object storage to minimize overhead. The system handles data scheduling automatically — no application-layer changes required.

Cloud-Native Scaling

Under Milvus’s distributed architecture, data sharding and load balancing prevent single-node memory overload. Memory pooling reduces fragmentation and improves utilization. Zilliz Cloud (fully managed Milvus) takes this further with elastic scheduling for on-demand memory scaling — in Serverless mode, idle resources are automatically released, further reducing total cost of ownership.

How These Layers Compound

These optimizations aren’t alternatives — they stack. RaBitQ shrinks the vectors. DiskANN keeps the index on SSD. Mmap avoids loading cold data into memory. Tiered storage pushes archival data to object storage. The result: a deployment serving billions of vectors doesn’t need billions-of-vectors worth of RAM.

Get Started

As AI data volumes continue to grow, vector database efficiency and cost will directly determine how far AI applications can scale. We’ll continue investing in high-performance, low-cost vector infrastructure — so that more AI applications can move from prototype to production.

Milvus is open source. To try IVF_RABITQ:

Check the IVF_RABITQ documentation for configuration and tuning guidance.
Read the full RaBitQ integration blog post for deeper benchmarks and implementation details.
Join the Milvus Slack community to ask questions and learn from other developers.
Book a free Milvus Office Hours session to walk through your use case.

If you’d rather skip infrastructure setup, Zilliz Cloud (fully managed Milvus) offers a free tier with IVF_RABITQ support.

We’re running an upcoming interview with Professor Cheng Long (NTU, VectorDB@NTU) and Dr. Jianyang Gao (ETH Zurich), the first author of RaBitQ, where we’ll go deeper into vector quantization theory and what’s next. Drop your questions in the comments.

Frequently Asked Questions

What are TurboQuant and RaBitQ?

TurboQuant (Google, ICLR 2026) and RaBitQ (Gao & Long, SIGMOD 2024) are both vector quantization methods that use random rotation to compress high-dimensional vectors. TurboQuant targets KV cache compression in LLM inference, while RaBitQ targets persistent vector indexes in databases. Both have contributed to the current wave of interest in vector quantization, though they solve different problems for different systems.

How does RaBitQ achieve 1-bit quantization without destroying recall?

RaBitQ exploits measure concentration in high-dimensional spaces: the angles between vectors are more stable than individual coordinate values as dimensionality increases. It normalizes vectors relative to the dataset centroid, then projects each one onto the nearest vertex of a hypercube (reducing each dimension to a single bit). An unbiased distance estimator with a provable error bound keeps search accurate despite the compression.

What is IVF_RABITQ and when should I use it?

IVF_RABITQ is a vector index type in Milvus (available since version 2.6) that combines inverted file clustering with RaBitQ 1-bit quantization. It achieves 94.7% recall at 3.6x the throughput of IVF_FLAT, with memory usage at roughly 1/32 of the original vectors. Use it when you need to serve large-scale vector search (millions to billions of vectors) and memory cost is a primary concern — common in RAG, recommendation, and multimodal search workloads.

How does vector quantization relate to KV cache compression in LLMs?

Both problems involve compressing high-dimensional floating-point vectors. KV cache stores key-value pairs from the Transformer attention mechanism; at long context lengths, it can exceed the model weights in memory usage. Vector quantization techniques like RaBitQ reduce these vectors to lower-bit representations. The same mathematical principles — measure concentration, random rotation, unbiased distance estimation — apply whether you’re compressing vectors in a database index or in an inference engine’s KV cache.

Is MCP Dead? What We Learned Building with MCP, CLI, and Agent Skills

Milvus — Thu, 18 Jun 2026 10:24:46 GMT

When Perplexity’s CTO Denis Yarats said at ASK 2026 that the company was deprioritizing MCP internally, it set off the usual cycle. YC CEO Garry Tan piled on — MCP eats too much context window, auth is broken, he built a CLI replacement in 30 minutes. Hacker News ran strongly anti-MCP.

A year ago, this level of public skepticism would have been unusual. Model Context Protocol (MCP) was positioned as the definitive standard for AI agent tool integration. Server counts were doubling weekly. The pattern since then has followed a familiar arc: rapid hype, broad adoption, then production disillusionment.

The industry is responding fast. Bytedance’s Lark/Feishu open-sourced their official CLI — 200+ commands across 11 business domains with 19 built-in Agent Skills. Google shipped gws for Google Workspace. The CLI + Skills pattern is quickly becoming the default for enterprise agent tooling, not a niche alternative.

At Zilliz, we’ve released Zilliz CLI, which lets you operate and manage Milvus and Zilliz Cloud (fully managed Milvus) directly from your terminal without leaving your coding environment. On top of that, we built Milvus Skills and Zilliz Skills so that AI coding agents like Claude Code and Codex can manage your vector database through natural language.

We also built an MCP server for Milvus and Zilliz Cloud one year ago. That experience taught us exactly where MCP breaks down — and where it still fits. Three architectural limitations pushed us toward CLI and Skills: context window bloat, passive tool design, and the inability to reuse the agent’s own LLM.

In this post, we’ll walk through each problem, show what we’re building instead, and lay out a practical framework for choosing between MCP, CLI, and Agent Skills.

MCP Eats 72% of Your Context Window at Startup

A standard MCP setup can consume around 72% of your available context window before the agent takes a single action. Connect three servers — GitHub, Playwright, and an IDE integration — on a 200K-token model, and tool definitions alone occupy roughly 143K tokens. The agent hasn’t done anything yet. It’s already three-quarters full.

The cost isn’t just tokens. The more unrelated content packed into context, the weaker the model’s focus on what actually matters. A hundred tool schemas sitting in context means the agent wades through all of them on every decision. Researchers have documented what they call context rot — degraded reasoning quality from context overload. In measured tests, tool selection accuracy dropped from 43% to below 14% as tool count increased. More tools, paradoxically, means worse tool use.

The root cause is architectural. MCP loads all tool descriptions in full at session start, regardless of whether the current conversation will ever use them. That’s a protocol-level design choice, not a bug — but the cost scales with every tool you add.

Agent skills take a different approach: progressive disclosure. At session start, an agent reads only each Skill’s metadata — name, one-line description, trigger condition. A few dozen tokens total. The full Skill content loads only when the agent determines it’s relevant. Think of it this way: MCP lines up every tool at the door and makes you choose; Skills gives you an index first, full content on demand.

CLI tools offer a similar advantage. An agent runs git — help or docker — help to discover capabilities on demand, without preloading every parameter definition. Context cost is pay-as-you-go, not upfront.

At a small scale, the difference is negligible. At production scale, it’s the difference between an agent that works and one that drowns in its own tool definitions.

MCP’s Passive Architecture Limits Agent Workflows

MCP is a tool-calling protocol: how to discover tools, call them, and receive results. Clean design for simple use cases. But that cleanness is also a constraint.

Flat Tool Space with No Hierarchy

An MCP tool is a flat function signature. No subcommands, no awareness of session lifecycle, no sense of where the agent is in a multi-step workflow. It waits to be called. That’s all it does.

A CLI works differently. git commit, git push, and git log are completely different execution paths sharing a single interface. An agent runs — help, explores the available surface incrementally, and expands only what it needs — without front-loading all the parameter documentation into context.

Skills Encode Workflow Logic — MCP Can’t

An Agent Skill is a Markdown file containing a standard operating procedure: what to do first, what to do next, how to handle failures, and when to surface something to the user. The agent receives not just a tool but an entire workflow. Skills actively shape how an agent behaves during a conversation — what triggers them, what they prepare in advance, and how they recover from errors. MCP tools can only wait.

MCP Can’t Access the Agent’s LLM

This is the limitation that actually stopped us.

When we built claude-context — an MCP plugin that adds semantic search to Claude Code and other AI coding agents, giving them deep context from an entire codebase — we wanted to retrieve relevant historical conversation snippets from Milvus and surface them as context. The vector search retrieval worked. The problem was what to do with the results.

Retrieve the top 10 results, and maybe 3 are useful. The other 7 are noise. Hand all 10 to the outer agent, and the noise interferes with the answer. In testing, we saw responses get distracted by irrelevant historical records. We needed to filter before passing results up.

We tried several approaches. Adding a reranking step inside the MCP server using a small model: not accurate enough, and the relevance threshold needed per-use-case tuning. Using a large model for reranking: technically sound, but an MCP server runs as a separate process with no access to the outer agent’s LLM. We’d have to configure a separate LLM client, manage a separate API key, and handle a separate call path.

What we wanted was simple: let the outer agent’s LLM participate directly in the filtering decision. Retrieve the top 10, let the agent itself judge what’s worth keeping, and return only the relevant results. No second model. No extra API keys.

MCP can’t do this. The process boundary between server and agent is also an intelligence boundary. The server can’t use the agent’s LLM; the agent can’t govern what happens inside the server. Fine for simple CRUD tools. The moment a tool needs to make a judgment call, that isolation becomes a real constraint.

An Agent Skill solves this directly. A retrieval Skill can call vector search for the top 10, have the agent’s own LLM assess relevance, and return only what passes. No additional model. The agent does the filtering itself.

What We Built Instead with CLI and Skills

We see CLI + Skills as the direction for agent-tool interaction — not just for memory retrieval, but across the stack. This conviction drives everything we’re building.

memsearch: A Skills-Based Memory Layer for AI Agents

We built memsearch, an open-source memory layer for Claude Code and other AI agents. The Skill runs inside a subagent with three stages: Milvus handles the initial vector search for broad discovery, the agent’s own LLM evaluates relevance and expands context for promising hits, and a final drill-down accesses original conversations only when needed. Noise gets discarded at each stage — intermediate retrieval junk never reaches the primary context window.

The key insight: the agent’s intelligence is part of the tool’s execution. The LLM already in the loop does the filtering — no second model, no extra API key, no brittle threshold tuning. This is a specific use case — conversation-context retrieval for coding agents — but the architecture generalizes to any scenario where a tool needs judgment, not just execution.

Zilliz CLI, Skills, and Plugin for Vector Database Operations

Milvus is the world’s most widely adopted open-source vector database with 43K+ stars on GitHub. Zilliz Cloud is the fully managed service of Milvus with advanced enterprise features and is much faster than Milvus.

The same layered architecture mentioned above drives our developer tools:

Zilliz CLI is the infrastructure layer. Cluster management, collection operations, vector search, RBAC, backups, billing — everything you’d do in the Zilliz Cloud console, available from the terminal. Humans and agents use the same commands. Zilliz CLI also serves as the foundation for Milvus Skills and Zilliz Skills.
Milvus Skill is the knowledge layer for open-source Milvus. It teaches AI coding agents (Claude Code, Cursor, Codex, GitHub Copilot) to operate any Milvus deployment — Milvus Lite, Standalone, or Distributed — through pymilvus Python code: connections, schema design, CRUD, hybrid search, full-text search, RAG pipelines.
Zilliz Skill does the same for Zilliz Cloud, teaching agents to manage cloud infrastructure through Zilliz CLI.
Zilliz Plugin is the developer experience layer for Claude Code — wraps CLI + Skill into a guided experience with slash commands like /zilliz:quickstart and /zilliz:status.

CLI handles execution, Skills encode knowledge and workflow logic, Plugin delivers the UX. No MCP server in the loop.

For more details, check out these resources:

Is MCP Actually Dying?

A lot of developers and companies including us here at Zilliz are turning to CLI and Skills. But is MCP really dying?

The short answer: no — but its scope is narrowing to where it actually fits.

MCP has been donated to the Linux Foundation. Active servers number over 10,000. SDK monthly downloads sit at 97 million. An ecosystem that size doesn’t disappear because of a conference comment.

A Hacker News thread — “When does MCP make sense vs CLI?” — drew responses that mostly favored CLI: “CLI tools are like precision instruments,” “CLIs also feel snappier than MCPs.” Some developers hold a more balanced view: Skills are a detailed recipe that helps you solve a problem better; MCP is the tool that helps you solve the problem. Both have their place.

That’s fair — but it raises a practical question. If the recipe itself can direct the agent on which tools to use and how, is a separate tool-distribution protocol still necessary?

It depends on the use case.

MCP over stdio — the version most developers run locally — is where the problems accumulate: unstable inter-process communication, messy environment isolation, high token overhead. In that context, better alternatives exist for almost every use case.

MCP over HTTP is a different story. Enterprise internal tooling platforms need centralized permission management, unified OAuth, standardized telemetry and logging. Fragmented CLI tools genuinely struggle to provide these. MCP’s centralized architecture has real value in that context.

What Perplexity actually dropped was primarily the stdio use case. Denis Yarats specified “internally” and didn’t call for industry-wide adoption of that choice. That nuance got lost in transmission — “Perplexity abandons MCP” spreads considerably faster than “Perplexity deprioritizes MCP over stdio for internal tool integration.”

MCP emerged because it solved a real problem: before it, every AI application wrote its own tool-calling logic, with no shared standard. MCP provided a unified interface at the right moment, and the ecosystem built quickly. Production experience then surfaced the limitations. That’s a normal arc for infrastructure tooling — not a death sentence.

When to Use MCP, CLI, or Skills

or teams choosing their agentic AI tooling stack, here’s how the layers fit:

Get Started

Everything we’ve discussed in this article is available today:

memsearch — the Skills-based memory layer for AI agents. Drop it into Claude Code or any agent that supports Skills.
Zilliz CLI — manage Milvus and Zilliz Cloud from your terminal. Install it and explore the subcommands your agents can use.
Milvus Skill and Zilliz Skill — give your AI coding agent native Milvus and Zilliz Cloud knowledge.

Have questions about vector search, agent architecture, or building with CLI and Skills? Join the Milvus Discord community or book a free Office Hours session to talk through your use case.

Ready to build? Sign up for Zilliz Cloud — new accounts with a work email get $100 in free credits. Already have an account? Sign in here.

Frequently Asked Questions

What is wrong with MCP for AI agents?

MCP has three main architectural limitations in production. First, it loads all tool schemas into the context window at session start — connecting just three MCP servers on a 200K-token model can consume over 70% of available context before the agent does anything. Second, MCP tools are passive: they wait to be called and can’t encode multi-step workflows, error-handling logic, or standard operating procedures. Third, MCP servers run as separate processes with no access to the agent’s LLM, so any tool that needs judgment (like filtering search results for relevance) requires configuring a separate model with its own API key. These problems are most acute with MCP over stdio; MCP over HTTP mitigates some of them.

What is the difference between MCP and Agent Skills?

MCP is a tool-calling protocol that defines how an agent discovers and invokes external tools. An Agent Skill is a Markdown file containing a full standard operating procedure — triggers, step-by-step instructions, error handling, and escalation rules. The key architectural difference: Skills run inside the agent’s process, so they can leverage the agent’s own LLM for judgment calls like relevance filtering or result reranking. MCP tools run in a separate process and can’t access the agent’s intelligence. Skills also use progressive disclosure — only lightweight metadata loads at startup, with full content loading on demand — keeping context window usage minimal compared to MCP’s upfront schema loading.

When should I still use MCP instead of CLI or Skills?

MCP over HTTP still makes sense for enterprise tooling platforms where you need centralized OAuth, role-based access control, standardized telemetry, and audit logging across many internal tools. Fragmented CLI tools struggle to provide these enterprise requirements consistently. For local development workflows — where agents interact with tools on your machine — CLI + Skills typically offers better performance, lower context overhead, and more flexible workflow logic than MCP over stdio.

How do CLI tools and Agent Skills work together?

CLI provides the execution layer (the actual commands), while Skills provide the knowledge layer (when to run which commands, in what order, and how to handle failures). For example, Zilliz CLI handles infrastructure operations like cluster management, collection CRUD, and vector search. Milvus Skill teaches the agent the right pymilvus patterns for schema design, hybrid search, and RAG pipelines. The CLI does the work; the Skill knows the workflow. This layered pattern — CLI for execution, Skills for knowledge, a plugin for UX — is how we’ve structured all of our developer tooling at Zilliz.

MCP vs Skills vs CLI: when should I use each?

CLI tools like git, docker, or zilliz-cli are best for operational tasks — they expose hierarchical subcommands and load on demand. Skills like milvus-skill are best for agent workflow logic — they carry operating procedures, error recovery, and can access the agent’s LLM. MCP over HTTP still fits enterprise tool platforms needing centralized OAuth, permissions, and audit logging. MCP over stdio — the local version — is being replaced by CLI + Skills in most production setups.

Harness Engineering: The Execution Layer AI Agents Actually Need

Milvus — Wed, 17 Jun 2026 09:22:05 GMT

Mitchell Hashimoto built HashiCorp and co-created Terraform. In February 2026, he published a blog post describing a habit he’d developed while working with AI agents: every time an agent made a mistake, he engineered a permanent fix into the agent’s environment. He called it “engineering the harness.” Within weeks, OpenAI and Anthropic published engineering articles expanding on the idea. The term Harness Engineering had arrived.

It resonated because it names a problem every engineer building AI agents has already hit. Prompt engineering gets you better single-turn outputs. Context engineering manages what the model sees. But neither addresses what happens when an agent runs autonomously for hours, making hundreds of decisions without supervision. That’s the gap Harness Engineering fills — and it almost always depends on hybrid search (hybrid full-text and semantic search) to work.

What Is Harness Engineering?

Harness Engineering is the discipline of designing the execution environment around an autonomous AI agent. It defines which tools the agent can call, where it gets information, how it validates its own decisions, and when it should stop.

To understand why it matters, consider three layers of AI agent development:

Prompt Engineering optimizes the quality of a single exchange — phrasing, structure, examples. One conversation, one output.

Context Engineering manages how much information the model can see at once — which documents to retrieve, how to compress history, what fits in the context window and what gets dropped.

Harness Engineering builds the world the agent operates in. Tools, knowledge sources, validation logic, architectural constraints — everything that determines whether an agent can run reliably across hundreds of decisions without human supervision.

Three layers of AI agent development: Prompt Engineering optimizes what you say, Context Engineering manages what the model sees, and Harness Engineering designs the execution environment

The first two layers shape the quality of a single turn. The third shapes whether an agent can operate for hours without you watching.

These aren’t competing approaches. They’re a progression. As agent capability grows, the same team moves through all three — often within a single project.

How OpenAI Used Harness Engineering to Build a Million-Line Codebase and Lessons They Learnt

OpenAI ran an internal experiment that puts Harness Engineering in concrete terms. They described it in their engineering blog post, “Harness Engineering: Leveraging Codex in an Agent-First World”. A three-person team started with an empty repository in late August 2025. For five months, they wrote no code themselves — every line was generated by Codex, OpenAI’s AI-powered coding agent. The result: one million lines of production code and 1,500 merged pull requests.

The interesting part isn’t the output. It’s the four problems they hit and the harness-layer solutions they built.

Problem 1: No Shared Understanding of the Codebase

What abstraction layer should the agent use? What are the naming conventions? Where did last week’s architecture discussion land? Without answers, the agent guessed — and guessed wrong — repeatedly.

The first instinct was a single AGENTS.md file containing every convention, rule, and historical decision. It failed for four reasons. Context is scarce, and a bloated instruction file crowded out the actual task. When everything is marked important, nothing is. Documentation rots — rules from week two become wrong by week eight. And a flat document can’t be mechanically verified.

The fix: shrink AGENTS.md to 100 lines. Not rules — a map. It points to a structured docs/ directory containing design decisions, execution plans, product specs, and reference docs. Linters and CI verify that cross-links stay intact. The agent navigates to exactly what it needs.

The underlying principle: if something isn’t in context at runtime, it doesn’t exist for the agent.

Problem 2: Human QA Couldn’t Keep Pace with Agent Output

The team plugged Chrome DevTools Protocol into Codex. The agent could screenshot UI paths, observe runtime events, and query logs with LogQL and metrics with PromQL. They set a concrete threshold: a service had to start in under 800 milliseconds before a task was considered complete. Codex tasks ran for over six hours at a stretch — typically while engineers slept.

Problem 3: Architectural Drift Without Constraints

Without guardrails, the agent reproduced whatever patterns it found in the repo — including bad ones.

The fix: strict layered architecture with a single enforced dependency direction — Types → Config → Repo → Service → Runtime → UI. Custom linters enforced these rules mechanically, with error messages that included the fix instruction inline.

Strict layered architecture with one-way dependency validation: Types at the base, UI at the top, custom linters enforce rules with inline fix suggestions

In a human team, this constraint usually arrives when a company scales to hundreds of engineers. For a coding agent, it’s a prerequisite from day one. The faster an agent moves without constraints, the worse the architectural drift.

Problem 4: Silent Technical Debt

The solution: encode the project’s core principles into the repository, then run background Codex tasks on a schedule to scan for deviations and submit refactoring PRs. Most merged automatically within a minute — small continuous payments rather than periodic reckoning.

Why AI Agents Can’t Grade Their Own Work

OpenAI’s experiment proved Harness Engineering works. But separate research exposed a failure mode inside it: agents are systematically bad at evaluating their own output.

The problem appears in two forms.

Context anxiety. As the context window fills, agents begin wrapping up tasks prematurely — not because the work is done, but because they sense the window limit approaching. Cognition, the team behind the AI coding agent Devin, documented this behavior while rebuilding Devin for Claude Sonnet 4.5: the model became aware of its own context window and started taking shortcuts well before actually running out of room.

Their fix was pure harness engineering. They enabled the 1M-token context beta but capped actual usage at 200K tokens — tricking the model into believing it had ample runway. The anxiety vanished. No model change required; just a smarter environment.

The most common general mitigation is compaction: summarize history and let the same agent continue with compressed context. This preserves continuity but doesn’t eliminate the underlying behavior. An alternative is context reset: clear the window, spin up a fresh instance, and hand off state through a structured artifact. This removes the anxiety trigger entirely but demands a complete handoff document — gaps in the artifact mean gaps in the new agent’s understanding.

Self-evaluation bias. When agents assess their own output, they score it high. Even on tasks with objective pass/fail criteria, the agent spots a problem, talks itself into believing it’s not serious, and approves work that should fail.

The fix borrows from GANs (Generative Adversarial Networks): separate the generator from the evaluator completely. In a GAN, two neural networks compete — one generates, one judges — and that adversarial tension forces quality up. The same dynamic applies to multi-agent systems.

Anthropic tested this with a three-agent harness — Planner, Generator, Evaluator — against a solo agent on the task of building a 2D retro game engine. They describe the full experiment in “Harness Design for Long-Running Application Development” (Anthropic, 2026). The Planner expands a short prompt into a full product spec, deliberately leaving implementation details unspecified — early over-specification cascades into downstream errors. The Generator implements features in sprints, but before writing code, it signs a sprint contract with the Evaluator: a shared definition of “done.” The Evaluator uses Playwright (Microsoft’s open-source browser automation framework) to click through the application like a real user, testing UI, API, and database behavior. If anything fails, the sprint fails.

The solo agent produced a game that technically launched, but entity-to-runtime connections were broken at the code level — discoverable only by reading the source. The three-agent harness produced a playable game with AI-assisted level generation, sprite animation, and sound effects.

Comparison of solo agent versus three-agent harness: solo agent ran 20 minutes at nine dollars with broken core functionality, while the full harness ran 6 hours at two hundred dollars producing a fully functional game with AI-assisted features

The three-agent architecture cost roughly 20x more. The output crossed from unusable to usable. That’s the core trade Harness Engineering makes: structural overhead in exchange for reliability.

The Retrieval Problem Inside Every Agent Harness

Both patterns — the structured docs/ system and the Generator/Evaluator sprint cycle — share a silent dependency: the agent must find the right information from a live, evolving knowledge base when it needs it.

This is harder than it looks. Take a concrete example: the Generator is executing Sprint 3, implementing user authentication. Before writing code, it needs two kinds of information.

First, a semantic search query: what are this product’s design principles around user sessions? The relevant document might use “session management” or “access control” — not “user authentication.” Without semantic understanding, retrieval misses it.

Second, an exact-match query: which documents reference the validateToken function? A function name is an arbitrary string with no semantic meaning. Embedding-based retrieval can’t reliably find it. Only keyword matching works.

These two queries happen simultaneously. They can’t be separated into sequential steps.

Pure vector search fails on exact match. Traditional BM25 fails on semantic queries and can’t predict which vocabulary a document will use. Before Milvus 2.5, the only option was two parallel retrieval systems — a vector index and a full-text index — running concurrently at query time with custom result-fusion logic. For a live docs/ repository with continuous updates, both indexes had to stay in sync: every document change triggered reindexing in two places, with the constant risk of inconsistency.

How Milvus 2.6 Solves Agent Retrieval with a Single Hybrid Pipeline

Milvus is an open-source vector database designed for AI workloads. Milvus 2.6’s Sparse-BM25 collapses the dual-pipeline retrieval problem into a single system.

At ingest, Milvus generates two representations simultaneously: a dense embedding for semantic retrieval and a TF-encoded sparse vector for BM25 scoring. Global IDF statistics update automatically as documents are added or removed — no manual reindex triggers. At query time, a natural-language input generates both query vector types internally. Reciprocal Rank Fusion (RRF) merges the ranked results, and the caller receives a single unified result set.

Before and after: two separate systems with manual sync, fragmented results, and custom fusion logic versus Milvus 2.6 single pipeline with dense embedding, Sparse BM25, RRF fusion, and automatic IDF maintenance producing unified results

One interface. One index to maintain.

On the BEIR benchmark — a standard evaluation suite covering 18 heterogeneous retrieval datasets — Milvus achieves 3–4x higher throughput than Elasticsearch at equivalent recall, with up to 7x QPS improvement on specific workloads. For the sprint scenario, a single query finds both the session design principle (semantic path) and every document mentioning validateToken (exact path). The docs/ repository updates continuously; BM25 IDF maintenance means a newly written document participates in the next query’s scoring without any batch rebuild.

This is the retrieval layer built for exactly this class of problem. When an agent harness needs to search a living knowledge base — code documentation, design decisions, sprint history — single-pipeline hybrid search isn’t a nice-to-have. It’s what makes the rest of the harness work.

The Best Harness Components Are Designed to Be Deleted

Every component in a harness encodes an assumption about model limitations. Sprint decomposition was necessary when models lost coherence on long tasks. Context reset was necessary when models experienced anxiety near the window limit. Evaluator agents became necessary when self-evaluation bias was unmanageable.

These assumptions expire. Cognition’s context-window trick may become unnecessary as models develop genuine long-context stamina. As models continue to improve, other components will become unnecessary overhead that slows agents down without adding reliability.

Harness Engineering isn’t a fixed architecture. It’s a system recalibrated with every new model release. The first question after any major upgrade isn’t “what can I add?” It’s “what can I remove?”

The same logic applies to retrieval. As models handle longer contexts more reliably, chunking strategies and retrieval timing will shift. Information that needs careful fragmentation today may be ingestible as full pages tomorrow. The retrieval infrastructure adapts alongside the model.

Every component in a well-built harness is waiting to be made redundant by a smarter model. That’s not a problem. That’s the goal.

Get Started with Milvus

If you’re building agent infrastructure that needs hybrid retrieval — semantic and keyword search in one pipeline — here’s where to start:

Read the Milvus 2.6 release notes for full details on Sparse-BM25, automatic IDF maintenance, and performance benchmarks.
Join the Milvus community to ask questions and share what you’re building.
Book a free Milvus Office Hours session to walk through your use case with a vector database expert.
If you’d rather skip infrastructure setup, Zilliz Cloud (fully managed Milvus) offers a free tier to get started with $100 free credits upon registration with work email.
Star us on GitHub: milvus-io/milvus — 43k+ stars and growing.

Frequently Asked Questions

What is harness engineering and how is it different from prompt engineering?

Prompt engineering optimizes what you say to a model in a single exchange — phrasing, structure, examples. Harness Engineering builds the execution environment around an autonomous AI agent: the tools it can call, the knowledge it can access, the validation logic that checks its work, and the constraints that prevent architectural drift. Prompt engineering shapes one conversation turn. Harness Engineering shapes whether an agent can operate reliably for hours across hundreds of decisions without human supervision.

Why do AI agents need both vector search and BM25 at the same time?

Agents must answer two fundamentally different retrieval queries simultaneously. Semantic queries — what are our design principles around user sessions? — require dense vector embeddings to match conceptually related content regardless of vocabulary. Exact-match queries — which documents reference the validateToken function? — require BM25 keyword scoring, because function names are arbitrary strings with no semantic meaning. A retrieval system that handles only one mode will systematically miss queries of the other type.

How does Milvus Sparse-BM25 work for agent knowledge retrieval?

At ingest, Milvus generates a dense embedding and a TF-encoded sparse vector for each document simultaneously. Global IDF statistics update in real time as the knowledge base changes — no manual reindexing required. At query time, both vector types are generated internally, Reciprocal Rank Fusion merges the ranked results, and the agent receives a single unified result set. The entire pipeline runs through one interface and one index — critical for continuously updated knowledge bases like a code documentation repository.

When should I add an evaluator agent to my agent harness?

Add a separate Evaluator when your Generator’s output quality cannot be verified by automated tests alone, or when self-evaluation bias has caused missed defects. The key principle: the Evaluator must be architecturally separate from the Generator — shared context reintroduces the same bias you’re trying to eliminate. The Evaluator should have access to runtime tools (browser automation, API calls, database queries) to test behavior, not just review code. Anthropic’s research found that this GAN-inspired separation moved output quality from “technically launches but broken” to “fully functional with features the solo agent never attempted.”

Interview with RaBitQ Authors: The TurboQuant Dispute and Why the Storage Selloff Was a False Alarm

Milvus — Wed, 17 Jun 2026 09:18:44 GMT

Google’s TurboQuant paper claimed 6x compression, 8x speedup, and near-zero accuracy loss for vector representations. After it was released, memory and storage stocks fell sharply, and major tech outlets quickly turned it into a headline story.

The market reaction was only the start. Researchers soon began asking whether the paper’s claims were overstated and whether it treated prior work — especially RaBitQ — fairly. The dispute pushed vector quantization back into the spotlight, partly because the same underlying ideas now matter in two critical parts of the AI stack: vector search systems and KV-cache compression for large models.

To understand both the technical debate and what it means for production systems, we spoke with Cheng Long, Associate Professor at NTU Singapore and head of VectorDB@NTU; Jianyang Gao, first author of RaBitQ; and Li Liu, Director of Engineering at Zilliz. The conversation covered vector quantization itself, the questions raised around TurboQuant, and why this matters for systems like Milvus, the most popular open-source vector databases, and large-scale vector retrieval.

Related reading: If you want the engineering side rather than the interview, see our companion article on how vector quantization affects AI infrastructure costs.

Why did vector quantization suddenly become such a big topic?

Zilliz: Before we get into the controversy, could you start by explaining what vector quantization is and why it has become so important in AI?

Cheng Long: Vector quantization is a technique for data compression and approximate representation. It originally came from signal processing, where it was used for image and audio compression. In modern AI systems, its role has changed because vectors have become one of the basic units of computation.

Today, its importance is clearest in two places.

One is real-time search over collections with billions or even tens of billions of vectors. In semantic retrieval systems, the core task is similarity search over high-dimensional vectors. But raw vectors are large, and floating-point computation is expensive. At scale, that makes it difficult to deliver millisecond-level latency. Vector quantization helps by compressing vectors into low-bit representations and speeding up distance computation. That is why it matters for practical workloads such as single-vector search, multi-vector search, and index design in Milvus search architecture.

The other is KV cache compression for large models. KV cache reduces redundant computation during generation, but the memory cost grows quickly as context gets longer. So the problem becomes how to compress those vectors without hurting output quality too much. At its core, that is also a vector quantization problem.

Zilliz: If vector quantization becomes more widely used — and if TurboQuant’s results hold up — does that mean storage demand drops sharply?

Jianyang Gao: Under the same model and the same workload, compression can reduce storage demand. But that does not justify the broader conclusion people jumped to.

When TurboQuant talks about 6x compression and 8x speedup, it is comparing against a basic 16-bit/32-bit baseline. That is not the same as comparing against other methods in the same category. So the real effect still needs to be evaluated more carefully.

Zilliz: Then from that perspective, if the market reaction were really about the technology itself, should it have happened much earlier, when similar ideas had already appeared?

Cheng Long: From a technical point of view, you could say that similar theoretical territory had already been reached before. But markets do not move in sync with research. There is usually a lag between academic results, engineering adoption, and financial interpretation.

And over a longer horizon, the effect may not even be linear. Compression can make it possible to run large models on smaller devices, which can create new demand rather than simply reduce it. The relationship between technology and markets is more complicated than a straight-line extrapolation.

How did RaBitQ emerge, and what did it contribute?

Zilliz: How did you first arrive at the idea for RaBitQ?

Jianyang Gao: We started from a gap we saw in vector databases. Traditional methods such as Product Quantization worked well empirically, but they offered very little in the way of theoretical guarantees.

At the time, I was studying high-dimensional probability at NTU Singapore, and that led me to ask whether we could build a method that was not only practical, but also came with a clear theoretical guarantee. That was the starting point for RaBitQ.

Zilliz: What do you see as RaBitQ’s core originality?

Jianyang Gao: Its key idea was to use a random rotation, a.k.a., Johnson-Lindenstrauss transformation, to make the distribution of vector coordinates more uniform and more predictable.

Once you have that, you can derive an optimal quantization estimator on top of it. We then gave a strict proof that it reaches the theoretical lower bound.

Earlier work had also tried to introduce random rotation. But from our perspective, those methods did not achieve the effect we were looking for because of practical issues in algorithm design.

Zilliz: From an engineering perspective, what stood out most to you about RaBitQ?

Li Liu: We had worked with many quantization algorithms, from scalar quantization methods to PQ and other variants. What stood out about RaBitQ was that it changed how people approached the problem.

Before that, much of the field was still fairly empirical. You could say a method seemed to work, but it was harder to explain clearly why. RaBitQ approached the problem in a much more mathematical way. The method felt elegant and, in a sense, simple. That way of thinking influenced a lot of later work.

Zilliz: Put simply, how much can it save in memory and cost?

Li Liu: At the same recall level, moving from 4-bit compression to 2-bit compression cuts memory use by half.

And it is not just about compression. Its performance compares favorably with earlier approaches, and that matters in production environments where teams care about both memory efficiency and retrieval quality. That is why it matters for systems that need to balance dense vector storage, throughput, and recall.

Zilliz: Beyond Milvus, where do you see RaBitQ being used today?

Cheng Long: First, I want to thank the Milvus team, because they were among the earliest to adopt RaBitQ. We also had a lot of discussions and some collaborative research along the way.

RaBitQ has also been adopted in some other systems including Meta’s FAISS, VSAG, VectorChord, Volcengine OpenSearch, CockroachDB, ElasticSearch, Lucene, and turbopuffer. What stands out on the Milvus side is that the team shipped IVF_RABITQ as a real index option in Milvus 2.6, alongside broader work on collection management, IVF-based indexing, and HNSW-based indexing.

How should we evaluate TurboQuant?

Zilliz: In your public response, you said TurboQuant had some serious issues. What, in your view, were the main ones?

Jianyang Gao: We see three main problems.

One is the way the paper describes prior work and discusses overlap. The TurboQuant paper misrepresents the methodology of RaBitQ, ignoring the most similar part, such as Johnson-Lindenstrauss Transformation. Another is the way the paper characterizes the theoretical result. It describes RaBitQ as suboptimal without providing any explanation or evidence, but RaBitQ is optimal in fact. The third is the fairness of the experimental comparison. They use single-core CPU to evaluate RaBitQ while using A100 GPU to evaluate TurboQuant.

Zilliz: Let’s take the benchmark issue first. Why do you think the comparison was not fair?

Jianyang Gao: Benchmark claims only mean something if the setup is comparable. If one system is tested under a very different hardware or software environment, then the result may reflect the setup more than the algorithm itself.

In our view, differences in processor choice, implementation language, and optimization level can make a major difference. That is why benchmark methodology needs to be interpreted very carefully, especially by teams building production retrieval systems.

Cheng Long: The paper also made some other claims that do not hold.

For example, the paper says that RaBitQ cannot be vectorized. But RaBitQ had already open-sourced code with SIMD-based vectorized computation when the 2024 paper was published. So from our perspective, that statement was factually incorrect.

It is also worth mentioning that we began working with NVIDIA last year and completed a GPU implementation of RaBitQ. The related code is under review for inclusion in NVIDIA’s cuVS library.

Zilliz: Milvus evaluated TurboQuant in the second half of 2025 but did not adopt it. What did your team see in testing?

Li Liu: It does contain one useful idea. In our view, it makes a small optimization in how the quantization grid is allocated. But the most important step in the method — using random rotation for quantization — was first introduced by RaBitQ.

And when it comes to unbiased estimation, RaBitQ’s approach is cleaner and its theoretical derivation is stronger.

That said, because this was a result from Google, we tested it in 2025. In our lab, under a standardized CPU environment, TurboQuant did not outperform our internal RaBitQ version in most of the cases we evaluated. So when the market reacted so strongly, we were genuinely surprised.

Zilliz: For readers who have not looked closely at either paper, could you walk through where RaBitQ and TurboQuant overlap in plain language?

Li Liu: At a high level, both methods begin with random rotation. Mathematically, that means multiplying the vector by a random orthogonal matrix. You can think of it as changing your viewing angle in a high-dimensional space. It does not change the relative positions of the data points, but it distributes the information across dimensions more evenly.

After that comes quantization. You divide the continuous real-valued space into 2^k grid cells, where k is the number of quantization bits, and then map each vector element to a nearby grid point. TurboQuant makes a small adjustment here by allocating the grid according to the data distribution instead of distributing it evenly.

The last step is error estimation, and this is where RaBitQ’s main contribution lies. Traditional methods calculate directly from the quantized values, which makes the error harder to control. RaBitQ estimates the quantization error more precisely, and that is where its mathematical optimality comes from. TurboQuant’s solution is more complicated, and in our setting the tradeoff did not look as attractive.

Why is attribution so hard to resolve in practice?

Zilliz: After you published your public statement, how did Google and ICLR respond?

Cheng Long: ICLR did not take action. We emailed them during the review period in September last year, but did not receive a response. We wrote again in March this year and were told to post comments on OpenReview, but beyond that there was no action.

As for Google, one of the co-authors replied a few days ago. The reply said they would revise the arXiv version to correct its inaccurate description of RaBitQ’s optimality.

Zilliz: Earlier the discussion was framed around academic misconduct. Now it also sounds like a question of imbalance and who gets to shape the story. Why is it so hard to defend your work?

Cheng Long: One problem is scale. AI conferences are now so large that a single cycle can bring in tens of thousands of papers. Organizers simply do not have the capacity to handle every dispute of this kind.

The other problem is imbalance. Large companies have a much stronger public voice. Independent researchers or smaller teams do not have the same communication power.

Jianyang Gao: For individuals, the cost is extremely high. Professor Long and I have barely been able to work normally in recent weeks.

The process itself has also been frustrating. We were firmly rejected when we contacted the authors, and we received no response from the conference organizers. In practice, many researchers look at situations like this and decide to let them go. But that is also how many original contributions disappear from the public narrative.

Zilliz: It sounds like this is not the first time your team has run into this kind of problem.

Cheng Long: No, it is not.

We have seen cases before where companies take RaBitQ, make a few engineering modifications, give it a new name, and then describe it only as something inspired by RaBitQ.

That is why I appreciate the way some industry teams handle this, including Milvus. When they use RaBitQ, they describe it objectively. And when they add optimizations beyond the original version, they explain those clearly as their own engineering contribution. That gives proper credit to the original work while also showing the company’s technical strength.

Zilliz: When large companies build on academic work, do they usually provide any financial sharing or benefit allocation?

Jianyang Gao: In most cases, no.

That said, large companies still have a strong incentive to present a technical advance as something they created themselves rather than something they adopted from others. Everyone wants customers and investors to see the most advanced work as the result of their own team’s innovation.

What comes next for vector quantization?

Zilliz: What research directions are you working on now?

Cheng Long: A large part of our work will remain focused on vector retrieval.

One direction is to combine RaBitQ with different vector retrieval indexes, such as IVF and HNSW, so the system can support larger-scale data with lower latency, higher concurrency, and lower cost. I am also paying attention to KV cache compression.

Jianyang Gao: KV cache in large models and vector retrieval share many of the same properties, both mathematically and at the systems level, because both deal with high-dimensional vectors.

Going forward, I want to think more about how to apply mathematical tools, including ideas from high-dimensional probability, to accelerate inference and training.

Zilliz: Where is the ceiling for vector quantization as a field? How much room is left for improvement?

Cheng Long: From a theoretical point of view, the ceiling is largely in sight. RaBitQ is already asymptotically optimal.

But there is still a great deal of room on the engineering side. You still have to deal with hardware characteristics, data distribution, latency constraints, and many other practical factors. That is exactly why production systems still need careful work in areas such as distributed vector database architecture, sparse vector support, reranking pipelines, and metric selection in Milvus distance metrics.

Keep Reading

If you want to dig deeper into the engineering side of RaBitQ and how it fits into Milvus, these are the most relevant resources:

IVF_RABITQ documentation — configuration details and tuning guidance.
RaBitQ integration deep dive — how Milvus turned RaBitQ into a production index.
How vector quantization affects AI infrastructure costs — our broader analysis of the TurboQuant-RaBitQ discussion.
Milvus 2.6 release post — where IVF_RABITQ shipped as a real Milvus index option.
Milvus index explained — how IVF_RABITQ fits with other index choices.
IVF_FLAT indexing and HNSW indexing — useful baselines if you are comparing index tradeoffs.
Schema design in Milvus and filtered search — useful if you are evaluating RaBitQ in a real application rather than in isolation.
Milvus quickstart and RAG system design — helpful if you want to try this in a retrieval pipeline.

Join the Milvus Slack community or book Milvus Office Hours if you want to talk through your workload.

If you’d rather skip infrastructure setup, you can sign up for Zilliz Cloud (fully managed Milvus) .

We Built Graph RAG Without the Graph Database

Milvus — Tue, 16 Jun 2026 03:24:26 GMT

TL;DR: Do you actually need a graph database for Graph RAG? No. Put entities, relations, and passages into Milvus. Use subgraph expansion instead of graph traversal, and one LLM rerank instead of multi-round agent loops. That’s Vector Graph RAG, and it’s what we built. This approach hits 87.8% average Recall@5 on three multi-hop QA benchmarks and beats HippoRAG 2 on a single Milvus instance.

Multi-hop questions are the wall that most RAG pipelines hit eventually. The answer is in your corpus, but it spans multiple passages connected by entities the question never names. The common fix is to add a graph database, which means running two systems instead of one.

We kept hitting this wall ourselves and didn’t want to run two databases just to handle it. So we built and open-sourced Vector Graph RAG, a Python library that brings multi-hop reasoning to RAG using only Milvus, the most widely adopted open-source vector database. It provides the same multi-hop capability with one database instead of two.

https://medium.com/media/699ba14e812c259ef2034c4c1373aa06/href

Why Multi-Hop Questions Break Standard RAG

Multi-hop questions break standard RAG because the answer depends on entity relationships that vector search can’t see. The bridge entity connecting the question to the answer often isn’t in the question itself.

Simple questions work fine. You chunk documents, embed them, retrieve the closest matches, and feed them to an LLM. “What indexes does Milvus support?” lives in one passage, and vector search finds it.

Multi-hop questions don’t fit that pattern. Take a question like “What side effects should I watch for with first-line diabetes drugs?” in a medical knowledge base.

Answering it takes two reasoning steps. First, the system has to know that metformin is the first-line drug for diabetes. Only then can it look up metformin’s side effects: kidney function monitoring, GI discomfort, vitamin B12 deficiency.

“Metformin” is the bridge entity. It connects the question to the answer, but the question never mentions it.

That’s where Vector similarity search stops. It retrieves passages that look like the question, diabetes treatment guides and drug side effect lists, but it can’t follow the entity relationships that link those passages together. Facts like “metformin is the first-line drug for diabetes” live in those relationships, not in the text of any single passage.

Why Graph Databases and Agentic RAG Aren’t the Answer

The standard ways to solve multi-hop RAG are graph databases and iterative agent loops. Both work. Both cost more than most teams want to pay for a single feature.

Take the graph-database route first. You extract triples from your documents, store them in a graph database, and traverse edges to find multi-hop connections. That means running a second system alongside your vector database, learning Cypher or Gremlin, and keeping the graph and vector stores in sync.

Iterative agent loops are the other approach. The LLM retrieves a batch, reasons over it, decides whether it has enough context, and retrieves again if not. IRCoT (Trivedi et al., 2023) makes 3–5 LLM calls per query. Agentic RAG can exceed 10 because the agent decides when to stop. Cost per query becomes unpredictable, and P99 latency spikes whenever the agent runs extra rounds.

Neither fits teams that want multi-hop reasoning without rebuilding their stack. So we tried something else.

What is Vector Graph RAG, a Graph Structure Inside a Vector Database

Vector Graph RAG is an open-source Python library that brings multi-hop reasoning to RAG using only Milvus. It stores graph structure as ID references across three Milvus collections. Traversal becomes a chain of primary-key lookups in Milvus instead of Cypher queries against a graph database. One Milvus does both jobs.

It works because relationships in a knowledge graph are just text. The triple (which is metformin, is the first-line drug for type 2 diabetes) is a directed edge in a graph database. It’s also a sentence: “Metformin is the first-line drug for type 2 diabetes.” You can embed that sentence as a vector and store it in Milvus, the same as any other text.

Answering a multi-hop query means following connections from what the query mentions (like “diabetes”) to what it doesn’t (like “metformin”). That only works if the storage preserves those connections: which entity connects to which through which relation. Plain text is searchable, but not followable.

To keep connections followable in Milvus, we give each entity and each relation a unique ID, then store them in separate collections that reference each other by ID. Three collections in total: entities (the nodes), relations (the edges), and passages (the source text, which the LLM needs for answer generation). Every row has a vector embedding, so we can semantic-search any of the three.

Entities store deduplicated entities. Each has a unique ID, a vector embedding for semantic search, and a list of relation IDs it participates in.

Relations store knowledge triples. Each records its subject and object entity IDs, the passage IDs it came from, and an embedding of the full relationship text.

Passages store original document chunks, with references to the entities and relations extracted from them.

The three collections point at each other through ID fields: entities carry the IDs of their relations, relations carry the IDs of their subject and object entities and source passages, and passages carry the IDs of everything extracted from them. That network of ID references is the graph.

Traversal is just a chain of ID lookups. You fetch entity e01 to get its relation_ids, fetch relations r01 and r02 by those IDs, read r01’s object_id to discover entity e02, and keep going. Each hop is a standard Milvus primary-key query. No Cypher required.

You might wonder whether the extra round trips to Milvus add up. They don’t. Subgraph expansion costs 2–3 ID-based queries totaling 20–30ms. The LLM call takes 1–3 seconds, which makes the ID lookups invisible next to it.

How Vector Graph RAG Answers a Multi-Hop Query

The retrieval flow takes a multi-hop query to a grounded answer in four steps: seed retrieval → subgraph expansion → LLM rerank → answer generation.

We’ll walk through the diabetes question: “What side effects should I watch for with first-line diabetes drugs?”

Step 1: Seed Retrieval

An LLM extracts key entities from the question: “diabetes,” “side effects,” “first-line drug.” Vector search in Milvus finds the most relevant entities and relations directly.

But metformin isn’t among them. The question doesn’t mention it, so vector search can’t find it.

Step 2: Subgraph Expansion

This is where Vector Graph RAG diverges from standard RAG.

The system follows ID references from the seed entities one hop out. It gets the seed entity IDs, finds all relations containing those IDs, and pulls the new entity IDs into the subgraph. Default: one hop.

Metformin, the bridge entity, enters the subgraph.

“Diabetes” has a relation: “Metformin is the first-line drug for type 2 diabetes.” Following that edge brings metformin in. Once metformin is in the subgraph, its own relations come with it: “Patients on metformin should have kidney function monitored,” “Metformin may cause gastrointestinal discomfort,” “Long-term metformin use may lead to vitamin B12 deficiency.”

Two facts that lived in separate passages are now connected through one hop of graph expansion. The bridge entity the question never mentioned is now discoverable.

Step 3: LLM Rerank

Expansion leaves you with dozens of candidate relations. Most are noise.

Expanded candidate pool (example):
r01: Metformin is the first-line drug for type 2 diabetes          ← Key
r02: Patients on metformin should have kidney function monitored   ← Key
r03: Metformin may cause gastrointestinal discomfort               ← Key
r04: Type 2 diabetes patients should have regular eye exams        ✗ Noise
r05: Insulin injection sites should be rotated                     ✗ Noise
r06: Diabetes is linked to cardiovascular disease risk             ✗ Noise
r07: Metformin is contraindicated in severe liver dysfunction      ✗ Noise (contraindication, not side effect)
r08: HbA1c is a monitoring indicator for diabetes                  ✗ Noise
r09: Sulfonylureas are second-line treatment for type 2 diabetes   ✗ Noise (second-line, not first-line)
r10: Long-term metformin use may lead to vitamin B12 deficiency    ← Key
...(more)

The system sends these candidates and the original question to an LLM: “Which relate to side effects of first-line diabetes drugs?” It’s one call with no iteration.

After LLM filtering:
✓ r01: Metformin is the first-line drug for type 2 diabetes          → Establishes the bridge: first-line drug = metformin
✓ r02: Patients on metformin should have kidney function monitored   → Side effect: kidney impact
✓ r03: Metformin may cause gastrointestinal discomfort               → Side effect: GI issues
✓ r10: Long-term metformin use may lead to vitamin B12 deficiency    → Side effect: nutrient deficiency

The selected relations cover the full chain: diabetes → metformin → kidney monitoring / GI discomfort / B12 deficiency.

Step 4: Answer Generation

The system retrieves the original passages for the selected relations and sends them to the LLM.

The LLM generates from full passage text, not the trimmed triples. Triples are compressed summaries. They lack the context, caveats, and specifics the LLM needs to produce a grounded answer.

See Vector Graph RAG in action

We also built an interactive frontend that visualizes each step. Click through the steps panel on the left and the graph updates in real time: orange for seed nodes, blue for expanded nodes, green for selected relations. It makes the retrieval flow concrete instead of abstract.

Why One Rerank Beats Multiple Iterations

Our pipeline makes two LLM calls per query: one for rerank, one for generation. Iterative systems like IRCoT and Agentic RAG run 3 to 10+ calls because they loop: retrieve, reason, retrieve again. We skip the loop because vector search and subgraph expansion cover both semantic similarity and structural connections in one pass, giving the LLM enough candidates to finish in one rerank.

In production, that’s roughly 60% lower API cost, 2–3x faster responses, and predictable latency. No surprise spikes when an agent decides to run extra rounds.

Benchmark Results

Vector Graph RAG averages 87.8% Recall@5 across three standard multi-hop QA benchmarks, matching or exceeding every method we tested, including HippoRAG 2, with just Milvus and 2 LLM calls.

We evaluated on MuSiQue (2–4 hop, the hardest), HotpotQA (2 hop, the most widely used), and 2WikiMultiHopQA (2 hop, cross-document reasoning). The metric is Recall@5: whether the correct supporting passages appear in the top 5 retrieved results.

We used the exact same pre-extracted triples from the HippoRAG repository for a fair comparison. No re-extraction, no custom preprocessing. The comparison isolates the retrieval algorithm itself.

Vector Graph RAG vs Standard (Naive) RAG

Vector Graph RAG lifts average Recall@5 from 73.4% to 87.8%, an improvement of 19.6 percentage points.

MuSiQue: largest gain (+31.4 pp). 3–4 hop benchmark, the hardest multi-hop questions, and exactly where subgraph expansion has the biggest impact.
2WikiMultiHopQA: sharp improvement (+27.7 pp). Cross-document reasoning, another sweet spot for subgraph expansion.
HotpotQA: smaller gain (+6.1 pp), but standard RAG already scores 90.8% on this dataset. The ceiling is low.

Vector Graph RAG vs State-of-the-Art Methods (SOTA)

Vector Graph RAG takes the highest average score at 87.8% against HippoRAG 2, IRCoT, and NV-Embed-v2.

Benchmark by benchmark:

HotpotQA: ties HippoRAG 2 (both 96.3%)
2WikiMultiHopQA: leads by 3.7 points (94.1% vs 90.4%)
MuSiQue (the hardest): trails by 1.7 points (73.0% vs 74.7%)

Vector Graph RAG achieves these numbers with only 2 LLM calls per query, no graph database, and no ColBERTv2. It runs on the simplest infrastructure in the comparison and still takes the highest average.

How Vector Graph RAG Compares to Other Graph RAG Approaches

Different Graph RAG approaches optimize for different problems. Vector Graph RAG is built for production multi-hop QA with predictable cost and simple infrastructure.

Microsoft GraphRAG uses hierarchical community clustering to answer global summarization questions like ‘what are the main themes across this corpus?’ That’s a different problem than multi-hop QA.”

HippoRAG 2 (Gutierrez et al., 2025) uses cognitive-inspired retrieval with ColBERTv2 token-level matching. Loading the full graph into memory limits scalability.

Iterative approaches like IRCoT trade infrastructure simplicity for LLM cost and unpredictable latency.

Vector Graph RAG targets production multi-hop QA: teams that want predictable cost and latency without adding a graph database.

When to Use Vector Graph RAG and Key Use Cases

Vector Graph RAG is built for four kinds of workloads:

Get Started with Vector Graph RAG

pip install vector-graph-rag
from vector_graph_rag import VectorGraphRAG
rag = VectorGraphRAG()  # defaults to Milvus Lite, no server needed
rag.add_texts([
    "Metformin is the first-line drug for type 2 diabetes.",
    "Patients taking metformin should have their kidney function monitored regularly.",
])
result = rag.query("What side effects should I watch for with first-line diabetes drugs?")
print(result.answer)

VectorGraphRAG() with no arguments defaults to Milvus Lite. It creates a local .db file, like SQLite. No server to start, nothing to configure.

add_texts() calls an LLM to extract triples from your text, vectorizes them, and stores everything in Milvus. query() runs the full four-step retrieval flow: seed, expand, rerank, generate.

For production, swap one URI parameter. The rest of the code stays the same:

Local development
rag = VectorGraphRAG()

# Self-hosted Milvus
rag = VectorGraphRAG(uri="http://your-milvus-server:19530")

# Zilliz Cloud (managed Milvus, free tier available)
rag = VectorGraphRAG(uri="your-zilliz-endpoint", token="your-api-key")

To import PDFs, web pages, or Word files:

from vector_graph_rag.loaders import DocumentImporter

importer = DocumentImporter(chunk_size=1000, chunk_overlap=200)
result = importer.import_sources([
    "https://en.wikipedia.org/wiki/Metformin",
    "/path/to/clinical-guidelines.pdf",
])
rag.add_documents(result.documents, extract_triplets=True)

Conclusion

Graph RAG doesn’t need a graph database. Vector Graph RAG stores graph structure as ID references across three Milvus collections, which turns graph traversal into primary-key lookups and keeps every multi-hop query at a fixed two LLM calls.

At a glance:

Open-source Python library. Multi-hop reasoning on Milvus alone.
Three collections linked by ID. Entities (nodes), relations (edges), passages (source text). Subgraph expansion follows the IDs to discover bridge entities the query doesn’t mention.
Two LLM calls per query. One rerank, one generation. No iteration.
87.8% average Recall@5 across MuSiQue, HotpotQA, and 2WikiMultiHopQA, matching or beating HippoRAG 2 on two of three.

Try it:

GitHub: zilliztech/vector-graph-rag for the code
Docs for the full API and examples
Join the Milvus community on Discord to ask questions and share feedback
Book a Milvus Office Hours session to walk through your use case
Zilliz Cloud offers a free tier with managed Milvus if you’d rather skip infrastructure setup

FAQ

Can I do Graph RAG with just a vector database?

Yes. Vector Graph RAG stores knowledge graph structure (entities, relations, and their connections) inside three Milvus collections linked by ID cross-references. Instead of traversing edges in a graph database, it chains primary-key lookups in Milvus to expand a subgraph around seed entities. This achieves 87.8% average Recall@5 on three standard multi-hop benchmarks without any graph database infrastructure.

How does Vector Graph RAG compare to Microsoft GraphRAG?

They solve different problems. Microsoft GraphRAG uses hierarchical community clustering for global corpus summarization (“What are the main themes across these documents?”). Vector Graph RAG focuses on multi-hop question answering, where the goal is to chain specific facts across passages. Vector Graph RAG needs only Milvus and two LLM calls per query. Microsoft GraphRAG requires a graph database and carries higher indexing costs.

What types of questions benefit from multi-hop RAG?

Multi-hop RAG helps with questions where the answer depends on connecting information scattered across multiple passages, especially when a key entity never appears in the question. Examples include “What side effects does the first-line diabetes drug have?” (requires discovering metformin as the bridge), cross-reference lookups in legal or regulatory text, and dependency chain tracing in technical documentation. Standard RAG handles single-fact lookups well. Multi-hop RAG adds value when the reasoning path is two to four steps long.

Do I need to extract knowledge graph triples manually?

No. add_texts() and add_documents() automatically call an LLM to extract entities and relationships, vectorize them, and store them in Milvus. You can import documents from URLs, PDFs, and DOCX files using the built-in DocumentImporter. For benchmarking or migration, the library supports importing pre-extracted triples from other frameworks like HippoRAG.

How to Fix Hermes Agent’s Learning Loop with Milvus 2.6 Hybrid Search

Milvus — Tue, 16 Jun 2026 03:13:51 GMT

Hermes Agent has been everywhere lately. Built by Nous Research, Hermes is a self-hosted personal AI agent that runs on your own hardware (a $5 VPS works) and talks to you through existing chat channels like Telegram.

Its biggest highlight is a built-in learning loop: the loop creates Skills from experience, improves them during use, and searches past conversations to find reusable patterns. Other agent frameworks hand-code Skills before deployment. Hermes’s Skills grow from use, and repeated workflows become reusable with zero code change.

The catch is that Hermes’s retrieval is keyword-only. It matches exact words, but not the meaning users are after. When users use different wording across different sessions, the loop can’t connect them, and no new Skill gets written. When there are only a few hundred documents, the gap is tolerable. Past that, the loop stops learning because it can’t find its own history.

The fix is Milvus 2.6. Its hybrid search covers both meaning and exact keywords in a single query, so the loop can finally connect rephrased information across sessions. It’s light enough to fit on a small cloud server (a $5/month VPS runs it). Swapping it in doesn’t require changing Hermes — Milvus slots behind the retrieval layer, so the Learning Loop stays intact. Hermes still picks which Skill to run, and Milvus handles what to retrieve.

But the deeper payoff goes beyond better recall: once retrieval works, the Learning Loop can store the retrieval strategy itself as a Skill — not just the content it retrieves. That’s how the agent’s knowledge work compounds across sessions.

Hermes Agent Architecture: How Four-Layer Memory Powers the Skill Learning Loop

Hermes has four memory layers, and L4 Skills is the one that sets it apart.

L1 — session context, cleared when the session closes
L2 — persisted facts: project stack, team conventions, resolved decisions
L3 — SQLite FTS5 keyword search over local files
L4 — stores workflows as Markdown files. Unlike LangChain tools or AutoGPT plugins, which developers author in code before deployment, L4 Skills are self-written: they grow from what the agent actually runs, with zero developer authoring.

Why Hermes’s FTS5 Keyword Retrieval Breaks the Learning Loop

Hermes needs retrieval to trigger cross-session workflows in the first place. But its built-in L3 layer uses SQLite FTS5, which only matches literal tokens, not meaning.

When users phrase the same intent differently across sessions, FTS5 misses the match. The Learning Loop doesn’t fire. No new Skill gets written, and next time the intent comes around, the user is back to routing by hand.

Example: the knowledge base stores “asyncio event loop, async task scheduling, non-blocking I/O.” A user searches “Python concurrency.” FTS5 returns zero hits — no literal word overlap, and FTS5 has no way to see that they’re the same question.

Under a couple hundred documents, the gap is tolerable. Past that, documentation uses one vocabulary, and users ask in another, and FTS5 has no bridge between them. Unretrievable content might as well not be in the knowledge base, and the Learning Loop has nothing to learn from.

How Milvus 2.6 Fixes the Retrieval Gap with Hybrid Search and Tiered Storage

Milvus 2.6 brings two upgrades that fit Hermes’s failure points. Hybrid search unblocks the Learning Loop by covering both semantic and keyword retrieval in one call. Tiered storage keeps the whole retrieval backend small enough to run on the same $5/month VPS Hermes was built for.

What Hybrid Search Solves: Finding Relevant Information

Milvus 2.6 supports running both vector retrieval (semantic) and BM25 full-text search (keyword) in a single query, then merging the two ranked lists with Reciprocal Rank Fusion (RRF).

For example: ask “what is the principle of asyncio”, and vector retrieval hits semantically related content. Ask “where is the find_similar_task function defined", and BM25 precisely matches the function name in code. For questions that involve a function inside a particular type of task, hybrid search returns the right result in one call, with no hand-written routing logic.

For Hermes, this is what unblocks the Learning Loop. When a second session rephrases the intent, vector retrieval catches the semantic match FTS5 missed. The loop fires, and a new Skill gets written.

What Tiered Storage Solves: Cost

A naive vector database would want the full embedding index in RAM, which pushes personal deployments toward bigger, more expensive infrastructure. Milvus 2.6 avoids that with three-tier storage, moving entries between tiers based on access frequency:

Hot — in memory
Warm — on SSD
Cold — on object storage

Only hot data stays resident. A 500-document knowledge base fits under 2 GB of RAM. The whole retrieval stack runs on the same $5/month VPS Hermes targets, with no infrastructure upgrade needed.

Hermes + Milvus: System Architecture

Hermes picks which Skill to run. Milvus handles what to retrieve. The two systems stay separate, and Hermes’s interface doesn’t change.

The flow:

Hermes identifies the user’s intent and routes to a Skill.
The Skill calls a retrieval script through the terminal tool.
The script hits Milvus, runs hybrid search, and returns ranked chunks with source metadata.
Hermes composes the answer. Memory records the workflow.
When the same pattern repeats across sessions, the Learning Loop writes a new Skill.

How to Install Hermes and Milvus 2.6

Install Hermes and Milvus 2.6 Standalone, then create a collection with dense and BM25 fields. That’s the full setup before the Learning Loop can fire.

Install Hermes

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

Run hermes to enter the interactive init wizard:

LLM provider — OpenAI, Anthropic, OpenRouter (OpenRouter has free models)
Channel — this walkthrough uses a FLark bot

Run Milvus 2.6 Standalone

Single-node standalone is enough for a personal agent:

curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh \
-o standalone_embed.sh
bash standalone_embed.sh start
# Verify service status
docker ps | grep milvus

Create the Collection

Schema design caps what retrieval can do. This schema runs dense vectors and BM25 sparse vectors side by side:

from pymilvus import MilvusClient, DataType, Function, FunctionType
client = MilvusClient(
    uri="http://192.168.x.x:19530",
)
schema = client.create_schema(auto_id=True, enable_dynamic_field=True)
schema.add_field("id", DataType.INT64, is_primary=True)
# Raw text (for BM25 full-text search)
schema.add_field(
    "text",
    DataType.VARCHAR,
    max_length=8192,
    enable_analyzer=True,
    enable_match=True
)
# Dense vector (semantic search)
schema.add_field("dense_vector", DataType.FLOAT_VECTOR, dim=1536)
# Sparse vector (BM25 auto-generated, Milvus 2.6 feature)
schema.add_field("sparse_vector", DataType.SPARSE_FLOAT_VECTOR)
schema.add_field("source", DataType.VARCHAR, max_length=512)
schema.add_field("chunk_index", DataType.INT32)
# Tell Milvus to auto-convert text to sparse_vector via BM25
bm25_function = Function(
    name="text_bm25",
    function_type=FunctionType.BM25,
    input_field_names=["text"],
    output_field_names=["sparse_vector"],
)
schema.add_function(bm25_function)
index_params = client.prepare_index_params()
# HNSW graph index (dense vector)
index_params.add_index(
    field_name="dense_vector",
    index_type="HNSW",
    metric_type="COSINE",
    params={"M": 16, "efConstruction": 256}
)
# BM25 inverted index (sparse vector)
index_params.add_index(
    field_name="sparse_vector",
    index_type="SPARSE_INVERTED_INDEX",
    metric_type="BM25"
)
client.create_collection(
    collection_name="hermes_milvus",
    schema=schema,
    index_params=index_params
)

Hybrid Search Script

import sys, json
from openai import OpenAI
from pymilvus import MilvusClient, AnnSearchRequest, RRFRanker
client = MilvusClient("http://192.168.x.x:19530")
oai    = OpenAI()
COLLECTION = "hermes_milvus"
def hybrid_search(query: str, top_k: int = 5) -> list[dict]:
    # 1. Vectorize query
    dense_vec = oai.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding
    # 2. Dense vector retrieval (semantic relevance)
    dense_req = AnnSearchRequest(
        data=[dense_vec],
        anns_field="dense_vector",
        param={"metric_type": "COSINE", "params": {"ef": 128}},
        limit=top_k * 2       # Widen candidate set, let RRF do final ranking
    )
    # 3. BM25 sparse vector retrieval (exact term matching)
    bm25_req = AnnSearchRequest(
        data=[query],
        anns_field="sparse_vector",
        param={"metric_type": "BM25"},
        limit=top_k * 2
    )
    # 4. RRF fusion ranking
    results = client.hybrid_search(
        collection_name=COLLECTION,
        reqs=[dense_req, bm25_req],
        ranker=RRFRanker(k=60),
        limit=top_k,
        output_fields=["text", "source", "doc_type"]
    )
    return [
        {
            "text":     r.entity.get("text"),
            "source":   r.entity.get("source"),
            "doc_type": r.entity.get("doc_type"),
            "score":    round(r.distance, 4)
        }
        for r in results[0]
    ]
if __name__ == "__main__":
    query= sys.argv[1] if len(sys.argv) > 1 else ""
    top_k  = int(sys.argv[2]) if len(sys.argv) > 2 else 5
    output = hybrid_search(query, top_k)
    print(json.dumps(output, ensure_ascii=False, indent=2))

The dense request widens the candidate pool by 2× so RRF has enough to rank from. text-embedding-3-small is the cheapest OpenAI embedding that still holds retrieval quality; swap in text-embedding-3-large if the budget allows.

With the environment and knowledge base ready, the next section puts the Learning Loop to the test.

Hermes Skill Auto-Generation in Practice

Two sessions show the Learning Loop in action. In the first, the user names the script by hand. In the second, a new session asks the same question without naming the script. Hermes picks up the pattern and writes three Skills.

Session 1: Call the Script by Hand

Open Hermes in Lark. Give it the script path and the retrieval target. Hermes invokes the terminal tool, runs the script, and returns the answer with source attribution. No Skill exists yet. This is a plain tool call.

Session 2: Ask Without Naming the Script

Clear the conversation. Start fresh. Ask the same category of question without mentioning the script or path.

Memory Writes First, Skill Follows

The Learning Loop records the workflow (script, arguments, return shape) and returns the answer. Memory holds the trace; no Skill exists yet.

The second session’s match tells the loop the pattern is worth keeping. When it fires, three Skills get written:

From this point on, users stop naming Skills. Hermes infers intent, routes to the Skill, pulls the relevant chunks from Memory, and writes the answer. There’s no Skill selector in the prompt.

Most RAG (retrieval-augmented generation) systems solve the storing-and-fetching problem, but the fetch logic itself is hard-coded in application code. Ask in a different way or in a new scenario, and retrieval breaks. Hermes stores the fetch strategy as a Skill, which means the fetch path becomes a document you can read, edit, and version. The line 💾 Memory updated · Skill 'hybrid-search-doc-qa' created isn’t a setup-complete marker. It’s the Agent committing a behavior pattern to long-term memory.

Hermes vs. OpenClaw: Accumulation vs. Orchestration

Hermes and OpenClaw answer different problems. Hermes is built for a single agent that accumulates memory and skills across sessions. OpenClaw is built for breaking a complex task into pieces and handing each piece to a specialized agent.

OpenClaw’s strength is orchestration. It optimizes for how much of a task gets done automatically. Hermes’s strength is accumulation: a single agent that remembers across sessions, with skills that grow from use. Hermes optimizes for long-term context and domain experience.

The two frameworks stack. Hermes ships a one-step migration path that pulls ~/.openclaw memory and skills into Hermes’s memory layers. An orchestration stack can sit on top, with an accumulation agent underneath.

For the OpenClaw side of the split, see What Is OpenClaw? Complete Guide to the Open-Source AI Agent on the Milvus blog.

Conclusion

Hermes’s Learning Loop turns repeated workflows into reusable Skills, but only if retrieval can connect them across sessions. FTS5 keyword search can’t. Milvus 2.6 hybrid search can: dense vectors handle meaning, BM25 handles exact keywords, RRF merges both, and tiered storage keeps the whole stack on a $5/month VPS.

The bigger point: once retrieval works, the agent doesn’t just store better answers: it stores better retrieval strategies as Skills. The fetch path becomes a versionable document that improves with use. That’s what separates an agent that accumulates domain expertise from one that starts fresh every session. For a comparison of how other agents handle (or fail to handle) this problem, see Claude Code’s Memory System Explained.

Get Started

Try the tools in this article:

Hermes Agent on GitHub — install script, provider setup, and channel configuration used above.
Milvus 2.6 Standalone Quickstart — single-node Docker deploy for the knowledge-base backend.
Milvus Hybrid Search Tutorial — full dense + BM25 + RRF example matching the script in this post.

Got questions about Hermes + Milvus hybrid search?

Join the Milvus Discord to ask about hybrid search, tiered storage, or Skill-routing patterns — other developers are building similar stacks.
Book a Milvus Office Hours session to walk through your own agent + knowledge-base setup with the Milvus team.

Want to skip the self-host?

Sign up or sign in to Zilliz Cloud — managed Milvus with hybrid search and tiered storage out of the box. New work-email accounts get $100 in free credits.

Frequently Asked Questions

How does Hermes Agent’s Skill Learning Loop actually work?

Hermes records every workflow it runs — the script called, arguments passed, and return shape — as a memory trace. When the same pattern appears across two or more sessions, the Learning Loop fires and writes a reusable Skill: a Markdown file that captures the workflow as a repeatable procedure. From that point on, Hermes routes to the Skill by intent alone, without the user naming it. The critical dependency is retrieval — the loop only fires if it can find the earlier session’s trace, which is why keyword-only search becomes a bottleneck at scale.

What’s the difference between hybrid search and vector-only search for agent memory?

Vector-only search handles meaning well but misses exact matches. If a developer pastes an error string like ConnectionResetError or a function name like find_similar_task, a pure vector search might return semantically related but wrong results. Hybrid search combines dense vectors (semantic) with BM25 (keyword) and merges the two result sets with Reciprocal Rank Fusion. For agent memory — where queries range from vague intent (“Python concurrency”) to exact symbols — hybrid search covers both ends in a single call without routing logic in your application layer.

Can I use Milvus hybrid search with AI agents other than Hermes?

Yes. The integration pattern is generic: the agent calls a retrieval script, the script queries Milvus, and results return as ranked chunks with source metadata. Any agent framework that supports tool calls or shell execution can use the same approach. Hermes happens to be a strong fit because its Learning Loop specifically depends on cross-session retrieval to fire, but the Milvus side is agent-agnostic — it doesn’t know or care which agent is calling it.

How much does a self-hosted Milvus + Hermes setup cost per month?

A single-node Milvus 2.6 Standalone on a 2-core / 4 GB VPS with tiered storage runs about $5/month. OpenAI text-embedding-3-small costs $0.02 per 1M tokens — a few cents per month for a personal knowledge base. LLM inference dominates total cost and scales with usage, not with the retrieval stack.

Claude Context: Reduce Claude Code Token Usage with Milvus-Powered Code Retrieval

Milvus — Mon, 15 Jun 2026 02:43:05 GMT

Large context windows make AI coding agents feel limitless, right up until they start reading half your repository to answer one question. For many Claude Code users, the expensive part is not just model reasoning. It is the retrieval loop: search a keyword, read a file, search again, read more files, and keep paying for irrelevant context.

Claude Context is an open-source code retrieval MCP server that gives Claude Code and other AI coding agents a better way to find relevant code. It indexes your repository, stores searchable code chunks in a vector database, and uses hybrid retrieval so the agent can pull in the code it actually needs instead of flooding the prompt with grep results.

In our benchmarks, Claude Context reduced token consumption by 39.4% on average and cut tool calls by 36.1% while preserving retrieval quality. This post explains why grep-style retrieval wastes context, how Claude Context works under the hood, and how it compares with a baseline workflow on real debugging tasks.

Claude Context GitHub repository trending and passing 10,000 stars

Why grep-style code retrieval burns tokens in AI coding agents

An AI coding agent can only write useful code if it understands the codebase around the task: function call paths, naming conventions, related tests, data models, and historical implementation patterns. A large context window helps, but it does not solve the retrieval problem. If the wrong files enter the context, the model still wastes tokens and may reason from irrelevant code.

Code retrieval usually falls into two broad patterns:

This is the same distinction developers see in RAG application design: literal matching is useful, but it is rarely enough when meaning matters. A function named compute_final_cost() may be relevant to a query about calculate_total_price() even if the exact words do not match. That is where semantic search helps.

In one debugging run, Claude Code repeatedly searched and read files before locating the right area. After several minutes, only a small fraction of the code it had consumed was relevant.

Claude Code grep-style search spending time on irrelevant file reads

That pattern is common enough that developers complain about it publicly: the agent can be smart, but the context retrieval loop still feels expensive and imprecise.

Developer comment about Claude Code context and token usage

Grep-style retrieval fails in three predictable ways:

Information overload: large repositories produce many literal matches, and most are not useful for the current task.
Semantic blindness: grep matches strings, not intent, behavior, or equivalent implementation patterns.
Context loss: line-level matches do not automatically include the surrounding class, dependencies, tests, or call graph.

A better code retrieval layer needs to combine keyword precision with semantic understanding, then return complete enough chunks for the model to reason about the code.

What is Claude Context?

Claude Context is an open-source Model Context Protocol server for code retrieval. It connects AI coding tools to a Milvus-backed code index, so an agent can search a repository by meaning instead of relying only on literal text search.

The goal is simple: when the agent asks for context, return the smallest useful set of code chunks. Claude Context does this by parsing the codebase, generating embeddings, storing chunks in the Milvus vector database, and exposing retrieval through MCP-compatible tools.

Because Claude Context is exposed through MCP, it can work with Claude Code, Gemini CLI, Cursor-style MCP hosts, and other MCP-compatible environments. The same core retrieval layer can support multiple agent interfaces.

How Claude Context works under the hood

Claude Context has two main layers: a reusable core module and integration modules. The core handles parsing, chunking, indexing, search, and incremental sync. The upper layer exposes those capabilities through MCP and editor integrations.

Claude Context architecture showing MCP integrations, core module, embedding provider, and vector database

How does MCP connect Claude Context to coding agents?

MCP provides the interface between the LLM host and external tools. By exposing Claude Context as an MCP server, the retrieval layer stays independent from any one IDE or coding assistant. The agent calls a search tool; Claude Context handles the code index and returns relevant chunks.

If you want to understand the broader pattern, the MCP + Milvus guide shows how MCP can connect AI tools to vector database operations.

Why use Milvus for code retrieval?

Code retrieval needs fast vector search, metadata filtering, and enough scale to handle large repositories. Milvus is designed for high-performance vector search and can support dense vectors, sparse vectors, and reranking workflows. For teams building retrieval-heavy agent systems, the multi-vector hybrid search docs and PyMilvus hybrid_search API show the same underlying retrieval pattern used in production systems.

Claude Context can use Zilliz Cloud as the managed Milvus backend, which avoids running and scaling the vector database yourself. The same architecture can also be adapted to self-managed Milvus deployments.

Which embedding providers does Claude Context support?

Claude Context supports multiple embedding options:

For related Milvus workflows, see the Milvus embedding overview, OpenAI embedding integration, Voyage embedding integration, and examples of running Ollama with Milvus.

Why is the core library written in TypeScript?

Claude Context is written in TypeScript because many coding-agent integrations, editor plugins, and MCP hosts are already TypeScript-heavy. Keeping the retrieval core in TypeScript makes it easier to integrate with application-layer tooling while still exposing a clean API.

The core module abstracts the vector database and embedding provider into a composable Context object:

import { Context, MilvusVectorDatabase, OpenAIEmbedding } from '@zilliz/claude-context-core';
// Initialize embedding provider
const embedding = new OpenAIEmbedding(...);
// Initialize vector database
const vectorDatabase = new MilvusVectorDatabase(...);
// Create context instance
const context = new Context({embedding, vectorDatabase});
// Index your codebase with progress tracking
const stats = await context.indexCodebase('./your-project');
// Perform semantic search
const results = await context.semanticSearch('./your-project', 'vector database operations');

How Claude Context chunks code and keeps indexes fresh

Chunking and incremental updates determine whether a code retrieval system is usable in practice. If chunks are too small, the model loses context. If chunks are too large, the retrieval system returns noise. If indexing is too slow, developers stop using it.

Claude Context handles this with AST-based chunking, a fallback text splitter, and Merkle tree-based change detection.

How does AST-based code chunking preserve context?

AST chunking is the primary strategy. Instead of splitting files by line count or character count, Claude Context parses code structure and chunks around semantic units such as functions, classes, and methods.

That gives each chunk three useful properties:

What happens when AST parsing fails?

For languages or files that AST parsing cannot handle, Claude Context falls back to LangChain’s RecursiveCharacterTextSplitter. It is less precise than AST chunking, but it prevents indexing from failing on unsupported input.

// Use recursive character splitting to preserve code structure
const splitter = RecursiveCharacterTextSplitter.fromLanguage(language, {
    chunkSize: 1000,
    chunkOverlap: 200,
});

How does Claude Context avoid re-indexing the whole repository?

Re-indexing an entire repository after every change is too expensive. Claude Context uses a Merkle tree to detect exactly what changed.

A Merkle tree assigns each file a hash, derives each directory hash from its children, and rolls the whole repository into a root hash. If the root hash is unchanged, Claude Context can skip indexing. If the root changes, it walks down the tree to find the changed files and re-embeds only those files.

Merkle tree change detection comparing unchanged and changed file hashes

Sync runs in three stages:

Local sync state is stored under ~/.context/merkle/, so Claude Context can restore the file hash table and serialized Merkle tree after a restart.

What happens when Claude Code uses Claude Context?

Setup is a single command before launching Claude Code:

claude mcp add claude-context -e OPENAI_API_KEY=your-openai-api-key -e MILVUS_TOKEN=your-zilliz-cloud-api-key -- npx @zilliz/claude-context-mcp@latest

After indexing the repository, Claude Code can call Claude Context when it needs codebase context. In the same bug-finding scenario that previously burned time on grep and file reads, Claude Context found the exact file and line number with a full explanation.

The tool is not limited to bug hunting. It also helps with refactoring, duplicate code detection, issue resolution, test generation, and any task where the agent needs accurate repository context.

At equivalent recall, Claude Context reduced token consumption by 39.4% and reduced tool calls by 36.1% in our benchmark. That matters because tool calls and irrelevant file reads often dominate the cost of coding-agent workflows.

Benchmark chart showing Claude Context reducing token usage and tool calls versus baseline

The project now has more than 10,000 GitHub stars, and the repository includes the full benchmark details and package links.

Claude Context GitHub star history showing rapid growth

How does Claude Context compare with grep on real bugs?

The benchmark compares pure text search with Milvus-backed code retrieval on real debugging tasks. The difference is not just fewer tokens. Claude Context changes the agent’s search path: it starts closer to the implementation that needs to change.

Case 1: Django YearLookup bug

Problem description: In the Django framework, the YearLookup query optimization breaks __iso_year filtering. When using the __iso_year filter, the YearLookup class incorrectly applies the standard BETWEEN optimization — valid for calendar years, but not for ISO week-numbering years.

# This should use EXTRACT('isoyear' FROM ...) but incorrectly uses BETWEEN
DTModel.objects.filter(start_date__iso_year=2020)
# Generated: WHERE "start_date" BETWEEN 2020-01-01 AND 2020-12-31
# Should be: WHERE EXTRACT('isoyear' FROM "start_date") = 2020

Baseline (grep):

🔧 directory_tree()
⚙️ Result: Retrieved 3000+ lines of directory structure (~50k tokens)
   Problem: Massive information overload, no direct relevance
🔧 search_text('ExtractIsoYear')
⚙️ Result: Found 21 matches across multiple files:
   - django/db/models/functions/__init__.py:5 (import statement)
   - django/db/models/functions/__init__.py:31 (export list)  
   - django/db/models/functions/datetime.py:93 (ExtractIsoYear class)
   Problem: Most are unrelated imports and registrations
🔧 edit_file('django/db/models/functions/datetime.py')
⚙️ Modified multiple registration statements, but this is the wrong direction for a fix

The text search focused on ExtractIsoYear registration instead of the optimization logic in YearLookup.

Claude Context:

🔧 search_code('YearLookup')
⚙️ Found 10 results for query: "YearLookup" across the codebase
   1. Code snippet (python) [repo__django__django]
      Location: django/db/models/lookups.py:568-577
      Context: YearExact class and get_bound_params method
   2. Code snippet (python) [repo__django__django]  
      Location: django/db/models/lookups.py:538-569
      Context: YearLookup base class and year_lookup_bounds method
🔧 edit_file(django/db/models/lookups.py)
⚙️ Successfully modified the core optimization logic, adding ISO year handling

Semantic search understood YearLookup as the core concept and went straight to the right class.

Django YearLookup benchmark table showing 93 percent fewer tokens with Claude Context

Result: 93% fewer tokens.

Case 2: Xarray swap_dims bug

Problem description: The Xarray library’s .swap_dims() method unexpectedly mutates the original object, violating the expectation of immutability.

import numpy as np
import xarray as xr
nz = 11
ds = xr.Dataset({
    "y": ("z", np.random.rand(nz)),
    "lev": ("z", np.arange(nz) * 10),
})
# This shouldn't modify ds, but it does
ds2 = ds.swap_dims(z="lev").rename_dims(lev="z").reset_index("lev").reset_coords()
ds2.swap_dims(z='lev')  # this unexpectedly modifies ds2['lev']
# ds2['lev'] now has dimension 'lev' instead of 'z'

Baseline (grep):

🔧 list_directory('repo__pydata__xarray')
⚙️ Result: Retrieved directory structure across multiple subdirectories
   Problem: Basic directory navigation, no targeted search
🔧 search_text('swap_dims')
⚙️ Result: Found 21 matches across core files:
   - xarray/core/common.py:600 (Dataset.swap_dims reference)
   - xarray/core/dataarray.py:753 (".rename() or .swap_dims() instead.")
   - xarray/core/dataarray.py:2026 (DataArray.swap_dims reference)
   - xarray/core/dataarray.py:2044 (def swap_dims method)
   Problem: Many matches scattered across different contexts
🔧 read_file('xarray/core/dataarray.py', offset=2000, limit=40)
⚙️ Result: Read lines 2001-2040 showing rename method documentation
🔧 read_file('xarray/core/dataset.py', offset=3600, limit=40) 
⚙️ Result: Read lines 3601-3640 showing rename_dims method

The baseline spent time navigating directories and reading nearby code before locating the actual implementation path.

Claude Context:

🔧 search_code('.swap_dims()')
⚙️ Found 10 results for query: ".swap_dims()" across the codebase
   1. Code snippet (python) [repo__pydata__xarray]
      Location: xarray/tests/test_formatting_html.py:67-76
   2. Code snippet (python) [repo__pydata__xarray]
      Location: xarray/tests/test_dataset.py:2926-2979
🔧 edit_file('xarray/core/dataset.py')
⚙️ Successfully modified the file, added dimension handling logic
🔧 edit_file('xarray/core/dataarray.py')
⚙️ Successfully modified the file, ensuring the original DataArray is not mutated

Semantic search located the relevant swap_dims() implementation and related context faster.

Xarray swap_dims benchmark table showing 62 percent fewer tokens with Claude Context

Result: 62% fewer tokens.

Get started with Claude Context

If you want to try the exact tool from this post, start with the Claude Context GitHub repository and the Claude Context MCP package. The repository includes setup instructions, benchmarks, and the core TypeScript packages.

If you want to understand or customize the retrieval layer, these resources are useful next steps:

Learn the vector database basics with the Milvus Quickstart.
Explore Milvus full text search and the LangChain full-text search tutorial if you want to combine BM25-style search with dense vectors.
Review open-source vector search engines if you are comparing infrastructure options.
Try the Zilliz Cloud Plugin for Claude Code if you want vector database operations directly inside the Claude Code workflow.

For help with Milvus or code retrieval architecture, join the Milvus community or book Milvus Office Hours for one-on-one guidance. If you would rather skip infrastructure setup, sign up for Zilliz Cloud or sign in to Zilliz Cloud and use managed Milvus as the backend.

Frequently Asked Questions

Why does Claude Code use so many tokens on some coding tasks?

Claude Code can use many tokens when a task requires repeated search and file-reading loops across a large repository. If the agent searches by keyword, reads irrelevant files, and then searches again, every file read adds tokens even when the code is not useful for the task.

How does Claude Context reduce Claude Code token usage?

Claude Context reduces token usage by searching a Milvus-backed code index before the agent reads files. It retrieves relevant code chunks with hybrid search, so Claude Code can inspect fewer files and spend more of its context window on code that actually matters.

Is Claude Context only for Claude Code?

No. Claude Context is exposed as an MCP server, so it can work with any coding tool that supports MCP. Claude Code is the main example in this post, but the same retrieval layer can support other MCP-compatible IDEs and agent workflows.

Do I need Zilliz Cloud to use Claude Context?

Claude Context can use Zilliz Cloud as a managed Milvus backend, which is the easiest path if you do not want to operate vector database infrastructure. The same retrieval architecture is based on Milvus concepts, so teams can also adapt it to self-managed Milvus deployments.

How to Add Long-Term Memory to Anthropic’s Managed Agents with Milvus

Milvus — Mon, 15 Jun 2026 02:42:41 GMT

Anthropic’s Managed Agents make agent infrastructure resilient. A 200-step task now survives a harness crash, a sandbox timeout, or a mid-flight infrastructure change without human intervention, and Anthropic reports p50 time-to-first-token dropped roughly 60% and p95 dropped over 90% after the decoupling.

What reliability doesn’t solve is memory. A 200-step code migration that hits a new dependency conflict on step 201 can’t efficiently look back at how it handled the last one. An agent running vulnerability scans for one customer has no idea that another agent already solved the same case an hour ago. Every session starts on a blank page, and parallel brains have no access to what the others have already worked out.

The fix is to pair the Milvus vector database with Anthropic’s Managed Agents: semantic recall within a session, and a shared vector memory layer across sessions. The session contract stays untouched, the harness gets one new layer, and long-horizon agent tasks get qualitatively different capabilities.

What Managed Agents Solved (and What They Didn’t)

Managed Agents solved reliability by decoupling the agent into three independent modules. What it didn’t solve is memory, either as semantic recall inside a single session or as shared experience across parallel sessions. Here’s what got decoupled, and where the memory gap sits inside that decoupled design.

The reframe that makes this design work is stated explicitly in Anthropic’s post:

“The session is not Claude’s context window.”

The context window is ephemeral: bounded in tokens, reconstructed per model call, and discarded when the call returns. The session is durable, stored outside the harness, and represents the system of record for the entire task.

When a harness crashes, the platform starts a fresh one with wake(sessionId). The new harness reads the event log via getSession(id), and the task picks up from the last recorded step, with no custom recovery logic to write and no session-level babysitting to operate.

What the Managed Agents post doesn’t address, and doesn’t claim to, is what the agent does when it needs to remember anything. Two gaps show up the moment you push real workloads through the architecture. One lives inside a single session; the other lives across sessions.

Problem 1: Why Linear Session Logs Fail Past a Few Hundred Steps

Linear session logs fail past a few hundred steps because sequential reads and semantic search are fundamentally different workloads, and the **getEvents()** API serves only the first one. Slicing by position or seeking to a timestamp is enough to answer “where did this session leave off.” It is not enough to answer the question an agent will predictably need on any long task: have we seen this kind of problem before, and what did we do about it?

Consider a code migration at step 200 that hits a new dependency conflict. The natural move is to look back. Did the agent run into something similar earlier in this same task? What approach was tried? Did it hold, or did it regress something else downstream?

With getEvents() there are two ways to answer that, and both are bad:

The session is good for recovery and audit, but it was not built with an index that supports “have I seen this before.” Long-horizon tasks are where that question stops being optional.

Solution 1: How to Add Semantic Memory to a Managed Agent’s Session

Add a Milvus collection alongside the session log and dual-write from **emitEvent**. The session contract stays untouched, and the harness gains semantic query over its own past.

Anthropic’s design leaves headroom for exactly this. Their post states that “any fetched events can also be transformed in the harness before being passed to Claude’s context window. These transformations can be whatever the harness encodes, including context organization… and context engineering.” Context engineering lives in the harness; the session only has to guarantee durability and queryability.

The pattern: every time emitEvent fires, the harness also computes a vector embedding for events worth indexing and inserts them into a Milvus collection.

rom pymilvus import MilvusClient

milvus_client = MilvusClient(uri="http://localhost:19530")

# Only index high-signal events. Tool retries and intermediate states are noise.
INDEXABLE_EVENT_TYPES = {"decision", "strategy", "resolution", "error_handling"}

async def emit_event(session_id: str, event: dict):
    # Original path: append to the session event stream.
    await session_store.append(session_id, event)

    # Extended path: embed the event content and insert into Milvus.
    if event["type"] in INDEXABLE_EVENT_TYPES:
        embedding = await embed(event["content"])
        milvus_client.insert(
            collection_name="agent_memory",
            data=[{
                "vector":     embedding,
                "session_id": session_id,
                "step":       event["step"],
                "event_type": event["type"],
                "content":    event["content"],
            }]
        )

When the agent hits step 200 and needs to recall prior decisions, the query is a vector search scoped to that session:

async def recall_similar(query: str, session_id: str, top_k: int = 5):
    query_vector = await embed(query)
    results = milvus_client.search(
        collection_name="agent_memory",
        data=[query_vector],
        filter=f'session_id == "{session_id}"',
        limit=top_k,
        output_fields=["step", "event_type", "content"]
    )
    return results[0]  # top_k most relevant past events

Three production details matter before this ships:

Pick what to index. Not every event deserves an embedding. Tool-call intermediate states, retry logs, and repetitive status events pollute retrieval quality faster than they improve it. The INDEXABLE_EVENT_TYPES policy is task-dependent, not global.
Define the consistency boundary. If the harness crashes between the session append and the Milvus insert, one layer is briefly ahead of the other. The window is small but real. Pick a reconciliation path (retry on restart, write-ahead log, or eventual reconciliation) rather than hoping.
Control embedding spend. A 200-step session that calls an external embedding API synchronously on every step produces an invoice nobody planned for. Queue embeddings and send them asynchronously in batches.

With those in place, recall takes milliseconds for the vector search plus under 100ms for the embedding call. The top-five most relevant past events land in context before the agent notices friction. The session keeps its original job as the durable log; the harness gains the ability to query its own past semantically rather than sequentially. That’s a modest change at the API surface and a structural change in what the agent can do on long-horizon tasks.

Problem 2: Why Parallel Claude Agents Can’t Share Experience

Parallel Claude agents can’t share experience because Managed Agents sessions are isolated by design. The same isolation that makes horizontal scaling clean also prevents every brain from learning from every other brain.

In a decoupled harness, brains are stateless and independent. That isolation unlocks the latency wins Anthropic reports, and it also keeps every session running in the dark about every other session.

Agent A spends 40 minutes diagnosing a tricky SQL injection vector for one customer. An hour later, Agent B picks up the same case for a different customer and spends its own 40 minutes walking the same dead ends, running the same tool calls, and arriving at the same answer.

For a single user running the occasional agent, that is wasted compute. For a platform running dozens of concurrent AI agents across code review, vulnerability scans, and documentation generation for different customers every day, the cost compounds structurally.

If the experience every session produces evaporates the moment the session ends, the intelligence is disposable. A platform built this way scales linearly but doesn’t get better at anything over time, the way human engineers do.

Solution 2: How to Build a Shared Agent Memory Pool with Milvus

Build one vector collection that every harness reads from at startup and writes to at shutdown, partitioned by tenant so experience pools across sessions without leaking across customers.

When a session ends, the key decisions, problems encountered, and approaches that worked are pushed into the shared Milvus collection. When a new brain initializes, the harness runs a semantic query as part of setup and injects the top-matching past experiences into the context window. Step one of the new agent inherits the lessons of every prior agent.

Two engineering decisions carry this from prototype to production.

Isolating Tenants with the Milvus Partition Key

Partition by **tenant_id**, and Customer A’s agent experiences physically don’t live in the same partition as Customer B’s. That’s isolation at the data layer rather than a query convention.

Brain A’s work on Company A’s codebase should never be retrievable by Company B’s agents. Milvus’s partition key handles this on a single collection, with no second collection per tenant and no sharding logic in application code.

# Declare partition key at schema creation.
schema.add_field(
    field_name="tenant_id",
    datatype=DataType.VARCHAR,
    max_length=64,
    is_partition_key=True   # Automatic per-tenant partitioning.
)

# Every query filters by tenant. Isolation is automatic.
results = milvus_client.search(
    collection_name="shared_agent_memory",
    data=[query_vector],
    filter=f'tenant_id == "{current_tenant}"',
    limit=5,
    output_fields=["content", "step", "session_id"]
)

Customer A’s agent experiences never surface in Customer B’s queries, not because the query filter is written correctly (though it has to be), but because the data physically does not live in the same partition as Customer B’s. One collection to operate, logical isolation enforced at the query layer, physical isolation enforced at the partition layer.

See the multi-tenancy strategies docs for when partition key fits versus when separate collections or databases do, and the multi-tenancy RAG patterns guide for production deployment notes.

Why Agent Memory Quality Needs Ongoing Work

Memory quality erodes over time: flawed workarounds that happened to succeed once get replayed and reinforced, and stale entries tied to deprecated dependencies keep misleading agents that inherit them. The defenses are operational programs, not database features.

An agent stumbles on a flawed workaround that happens to succeed once. It gets written to the shared pool. The next agent retrieves it, replays it, and reinforces the bad pattern with a second “successful” usage record.

Stale entries follow a slower version of the same path. A fix pinned to a dependency version that was deprecated six months ago keeps getting retrieved, and keeps misleading agents that inherit it. The older and more heavily used the pool, the more of this accumulates.

Three operational programs defend against this:

Confidence score. Track how often a memory has been successfully applied in downstream sessions. Decay entries that fail in replay. Promote entries that succeed repeatedly.
Time weighting. Prefer recent experiences. Retire entries past a known staleness threshold, often tied to major dependency version bumps.
Human spot checks. Entries with high retrieval frequency are high-leverage. When one of them is wrong, it is wrong many times, which is where human review pays back fastest.

Milvus alone doesn’t solve this, and neither does Mem0, Zep, or any other memory product. Enforcing one pool with many tenants and zero cross-tenant leakage is something you engineer once. Keeping that pool accurate, fresh, and useful is continuous operational work that no database ships pre-configured.

Takeaways: What Milvus Adds to Anthropic’s Managed Agents

Milvus turns Managed Agents from a reliable-but-forgetful platform into one that compounds experience over time by adding semantic recall inside a session and shared memory across agents.

Managed Agents answered the reliability question cleanly: both brains and hands are cattle, and any one can die without taking the task with it. That’s the infrastructure problem, and Anthropic solved it well.

What stayed open was growth. Human engineers compound over time; years of work turn into pattern recognition, and they don’t reason from first principles on every task. Today’s managed agents don’t, because every session starts on a blank page.

Wiring the session to Milvus for semantic recall inside a task and pooling experience across brains in a shared vector collection is what gives agents a past they can actually use. Plugging in Milvus is the infrastructure piece; pruning wrong memories, retiring stale ones, and enforcing tenant boundaries is the operational piece. Once both are in place, the shape of memory stops being a liability and starts being compounding capital.

Get Started

Try it locally: spin up an embedded Milvus instance with Milvus Lite. No Docker, no cluster, just pip install pymilvus. Production workloads graduate to Milvus Standalone or Distributed when you need them.
Read the design rationale: Anthropic’s Managed Agents engineering post walks through the session, harness, and sandbox decoupling in depth.
Got questions? Join the Milvus Discord community for agent memory design discussions, or book a Milvus Office Hours session to walk through your workload.
Prefer managed? Sign up for Zilliz Cloud (or sign in) for hosted Milvus with partition keys, scaling, and multi-tenancy built in. New accounts get free credits on a work email.

Frequently Asked Questions

Q: What’s the difference between a session and a context window in Anthropic’s Managed Agents?

The context window is the ephemeral set of tokens a single Claude call sees. It’s bounded and resets per model invocation. The session is the durable, append-only event log of everything that happened across the whole task, stored outside the harness. When a harness crashes, wake(sessionId) spawns a new harness that reads the session log and resumes. The session is the system of record; the context window is working memory. The session is not the context window.

Q: How do I persist agent memory across Claude sessions?

The session itself is already persistent; that’s what getSession(id) retrieves. What’s typically missing is queryable long-term memory. The pattern is to embed high-signal events (decisions, resolutions, strategies) into a vector database like Milvus during emitEvent, then query by semantic similarity at retrieval time. This gives you both the durable session log Anthropic provides and a semantic recall layer for looking back across hundreds of steps.

Q: Can multiple Claude agents share memory?

Not out of the box. Each Managed Agents session is isolated by design, which is what lets them scale horizontally. To share memory across agents, add a shared vector collection (for example in Milvus) that each harness reads from at startup and writes to at shutdown. Use Milvus’s partition key feature to isolate tenants so Customer A’s agent memories never leak into Customer B’s sessions.

Q: What’s the best vector database for AI agent memory?

The honest answer depends on scale and deployment shape. For prototypes and small workloads, a local embedded option like Milvus Lite runs in-process with no infrastructure. For production agents across many tenants, you want a database with mature multi-tenancy (partition keys, filtered search), hybrid search (vector + scalar + keyword), and millisecond-latency at millions of vectors. Milvus is purpose-built for vector workloads at that scale, which is why it appears in production agent memory systems built on LangChain, Google ADK, Deep Agents, and OpenAgents.

DeepSeek V4 vs GPT-5.5 vs Qwen3.6: Which Model Should You Use?

Milvus — Thu, 11 Jun 2026 10:40:05 GMT

New model releases are moving faster than production teams can evaluate them. DeepSeek V4, GPT-5.5, and Qwen3.6–35B-A3B all look strong on paper, but the harder question for AI application developers is practical: which model should you use for retrieval-heavy systems, coding tasks, long-context analysis, and RAG pipelines?

This article compares the three models in practical tests: live information retrieval, concurrency-bug debugging, and long-context marker retrieval. Then it shows how to connect DeepSeek V4 to Milvus vector database, so retrieved context comes from a searchable knowledge base instead of the model’s parameters alone.

What Are DeepSeek V4, GPT-5.5, and Qwen3.6–35B-A3B?

DeepSeek V4, GPT-5.5, and Qwen3.6–35B-A3B are different AI models that target different parts of the model stack. DeepSeek V4 focuses on open-weight long-context inference. GPT-5.5 focuses on frontier-hosted performance, coding, online research, and tool-heavy tasks. Qwen3.6–35B-A3B focuses on open-weight multimodal deployment with a much smaller active-parameter footprint.

The comparison matters because a production vector search system rarely depends on the model alone. Model capability, context length, deployment control, retrieval quality, and serving cost all affect the final user experience.

DeepSeek V4: An Open-Weight MoE Model for Long-Context Cost Control

DeepSeek V4 is an open-weight MoE model family released by DeepSeek on April 24, 2026. The official release lists two variants: DeepSeek V4-Pro and DeepSeek V4-Flash. V4-Pro has 1.6T total parameters with 49B activated per token, while V4-Flash has 284B total parameters with 13B activated per token. Both support a 1M-token context window.

The DeepSeek V4-Pro model card also lists the model as MIT-licensed and available through Hugging Face and ModelScope. For teams building long-context document workflows, the main appeal is cost control and deployment flexibility compared with fully closed frontier APIs.

GPT-5.5: A Hosted Frontier Model for Coding, Research, and Tool Use

GPT-5.5 is a closed frontier model released by OpenAI on April 23, 2026. OpenAI positions it for coding, online research, data analysis, document work, spreadsheet work, software operation, and tool-based tasks. The official model docs list gpt-5.5 with a 1M-token API context window, while Codex and ChatGPT product limits may differ.

OpenAI reports strong coding benchmark results: 82.7% on Terminal-Bench 2.0, 73.1% on Expert-SWE, and 58.6% on SWE-Bench Pro. The tradeoff is price: the official API pricing lists GPT-5.5 at $5 per 1M input tokens and $30 per 1M output tokens, before any product-specific or long-context pricing details.

Qwen3.6–35B-A3B: A Smaller Active-Parameter Model for Local and Multimodal Workloads

Qwen3.6–35B-A3B is an open-weight MoE model from Alibaba’s Qwen team. Its model card lists 35B total parameters, 3B activated parameters, a vision encoder, and Apache-2.0 licensing. It supports a native 262,144-token context window and can extend to about 1,010,000 tokens with YaRN scaling.

That makes Qwen3.6–35B-A3B attractive when local deployment, private serving, image-text input, or Chinese-language workloads matter more than managed frontier-model convenience.

DeepSeek V4 vs GPT-5.5 vs Qwen3.6: Model Specs Compared

How We Tested DeepSeek V4, GPT-5.5, and Qwen3.6

These tests are not a replacement for full benchmark suites. They are practical checks that mirror common developer questions: can the model retrieve current information, reason about subtle code bugs, and locate facts inside a very long document?

Which Model Handles Real-Time Information Retrieval Best?

We asked each model three time-sensitive questions using web search where available. The instruction was simple: return only the answer and include the source URL.

OpenAI’s image pricing page uses the label “medium” rather than “standard” for the $0.053 1024×1024 result, so the question is normalized here to match the current API wording.

Real-Time Retrieval Results: GPT-5.5 Had the Clearest Advantage

DeepSeek V4-Pro

DeepSeek V4-Pro answered the first question incorrectly. It could not answer the second and third questions through live web search in this setup.

The second answer included the correct Billboard URL but did not retrieve the current №1 song. The third answer used the wrong source, so we counted it as incorrect.

GPT-5.5

GPT-5.5 handled this test much better. Its answers were short, accurate, sourced, and fast. When a task depends on current information and the model has live retrieval available, GPT-5.5 had the clear advantage in this setup.

Qwen3.6–35B-A3B

Qwen3.6–35B-A3B produced a result similar to DeepSeek V4-Pro. It did not have live web access in this setup, so it could not complete the real-time retrieval task.

Which Model Is Better at Debugging Concurrency Bugs?

The second test used a Python bank-transfer example with three layers of concurrency problems. The task was not just to find the obvious race condition, but also to explain why the total balance breaks and provide corrected code.

The prompt and code are as shown below:

The following Python code simulates two bank accounts transferring
  money to each other. The total balance should always equal 2000,                                              
  but it often doesn't after running.                                                                           
                                                                                                                
  Please:                                                                                                       
  1. Find ALL concurrency bugs in this code (not just the obvious one)                                          
  2. Explain why Total ≠ 2000 with a concrete thread execution example                                          
  3. Provide the corrected code                                                                                 
                                                                                                                
  import threading                                                                                              
                                                                  
  class BankAccount:
      def __init__(self, balance):
          self.balance = balance                                                                                
   
      def transfer(self, target, amount):                                                                       
          if self.balance >= amount:                              
              self.balance -= amount
              target.balance += amount
              return True
          return False
                                                                                                                
  def stress_test():
      account_a = BankAccount(1000)                                                                             
      account_b = BankAccount(1000)                               

      def transfer_a_to_b():                                                                                    
          for _ in range(1000):
              account_a.transfer(account_b, 1)                                                                  
                                                                  
      def transfer_b_to_a():
          for _ in range(1000):
              account_b.transfer(account_a, 1)                                                                  
   
      threads = [threading.Thread(target=transfer_a_to_b) for _ in range(10)]                                   
      threads += [threading.Thread(target=transfer_b_to_a) for _ in range(10)]
                                                                                                                
      for t in threads: t.start()
      for t in threads: t.join()                                                                                
                                                                  
      print(f"Total: {account_a.balance + account_b.balance}")                                                  
      print(f"A: {account_a.balance}, B: {account_b.balance}")
                                                                                                                
  stress_test()

Code Debugging Results: GPT-5.5 Gave the Most Complete Answer

DeepSeek V4-Pro

DeepSeek V4-Pro gave a concise analysis and went straight to the ordered-lock solution, which is the standard way to avoid ABBA deadlock. Its answer demonstrated the right fix, but it did not spend much time explaining why the naive lock-based fix could introduce a new failure mode.

GPT-5.5

GPT-5.5 performed best on this test. It found the core issues, anticipated the deadlock risk, explained why the original code could fail, and provided a complete corrected implementation.

Qwen3.6–35B-A3B

Qwen3.6–35B-A3B identified the bugs accurately, and its example execution sequence was clear. The weaker part was the fix: it chose a global class-level lock, which makes every account share the same lock. That works for a small simulation, but it is a poor tradeoff for a real banking system because unrelated account transfers must still wait on the same lock.

In short: GPT-5.5 not only solved the current bug, but also warned about the next bug a developer might introduce. DeepSeek V4-Pro gave the cleanest non-GPT fix. Qwen3.6 found the issues and produced working code, but did not call out the scalability compromise.

Which Model Handles Long-Context Retrieval Best?

For the long-context test, we used the full text of Dream of the Red Chamber, roughly 850,000 Chinese characters. We inserted a hidden marker around the 500,000-character position:

【Milvus test verification code: ZK-7749-ALPHA】

Then we uploaded the file to each model and asked it to find both the marker content and its position.

Long-Context Retrieval Results: GPT-5.5 Found the Marker Most Precisely

DeepSeek V4-Pro

DeepSeek V4-Pro found the hidden marker, but it did not find the correct character position. It also gave the wrong surrounding context. In this test, it seemed to locate the marker semantically but lose track of the exact position while reasoning over the document.

GPT-5.5

GPT-5.5 found the marker content, the position, and the surrounding context correctly. It reported the position as 500,002 and even distinguished between zero-indexed and one-indexed counting. The surrounding context also matched the text used when inserting the marker.

Qwen3.6–35B-A3B

Qwen3.6–35B-A3B found the marker content and nearby context correctly, but its position estimate was wrong.

What Do These Tests Say About Model Selection?

The three tests point to a practical selection pattern: GPT-5.5 is the capability pick, DeepSeek V4-Pro is the long-context cost-performance pick, and Qwen3.6–35B-A3B is the local-control pick.

Use GPT-5.5 when you need the strongest result, and cost is secondary. Use DeepSeek V4-Pro when you need long context, lower serving cost, and API-friendly deployment. Use Qwen3.6–35B-A3B when open weights, private deployment, multimodal support, or serving-stack control matter most.

For retrieval-heavy applications, though, model choice is only half the story. Even a strong long-context model performs better when the context is retrieved, filtered, and grounded by a dedicated semantic search system.

Why RAG Still Matters for Long-Context Models

A long context window does not remove the need for retrieval. It changes the retrieval strategy.

In a RAG application, the model should not scan every document on every request. A vector database architecture stores embeddings, searches for semantically relevant chunks, applies metadata filters, and returns a compact context set to the model. That gives the model better input while reducing cost and latency.

Milvus fits this role because it handles collection schemas, vector indexing, scalar metadata, and retrieval operations in one system. You can start locally with Milvus Lite, move to a standalone Milvus quickstart, deploy with Docker installation or Docker Compose deployment, and scale further with Kubernetes deployment when the workload grows.

How to Build a RAG Pipeline with Milvus and DeepSeek V4

The following walkthrough builds a small RAG pipeline using DeepSeek V4-Pro for generation and Milvus for retrieval. The same structure applies to other LLMs: create embeddings, store them in a collection, search for relevant context, and pass that context into the model.

For a broader walkthrough, see the official Milvus RAG tutorial. This example keeps the pipeline small so the retrieval flow is easy to inspect.

Prepare the Environment

Install the Dependencies

! pip install --upgrade "pymilvus[model]" openai requests tqdm

If you are using Google Colab, you may need to restart the runtime after installing dependencies. Click the Runtime menu, then select Restart session.

DeepSeek V4-Pro supports an OpenAI-style API. Log in to the official DeepSeek website and set DEEPSEEK_API_KEY as an environment variable.

import os
os.environ["DEEPSEEK_API_KEY"] = "sk-*****************"

Prepare the Milvus Documentation Dataset

We use the FAQ pages from the Milvus 2.4.x documentation archive as the private knowledge source. This is a simple starter dataset for a small RAG demo.

First, download the ZIP file and extract the documentation into the milvus_docs folder.

! wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip
! unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs

We load all Markdown files from the milvus_docs/en/faq folder. For each document, we split the file content by #, which roughly separates major Markdown sections.

from glob import glob
text_lines = []
for file_path in glob("milvus_docs/en/faq/*.md", recursive=True):
    with open(file_path, "r") as file:
        file_text = file.read()
    text_lines += file_text.split("# ")

Set Up DeepSeek V4 and the Embedding Model

from openai import OpenAI
deepseek_client = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com",
)

Next, choose an embedding model. This example uses DefaultEmbeddingFunction from the PyMilvus model module. See the Milvus docs for more on embedding functions.

from pymilvus import model as milvus_model
embedding_model = milvus_model.DefaultEmbeddingFunction()

Generate a test vector, then print the vector dimension and the first few elements. The returned dimension is used when creating the Milvus collection.

test_embedding = embedding_model.encode_queries(["This is a test"])[0]
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])
768
[-0.04836066  0.07163023 -0.01130064 -0.03789345 -0.03320649 -0.01318448
 -0.03041712 -0.02269499 -0.02317863 -0.00426028]

Load Data into Milvus

Create a Milvus Collection

A Milvus collection stores vector fields, scalar fields, and optional dynamic metadata. The quick setup below uses the high-level MilvusClient API; for production schemas, review the docs on collection management and creating collections.

from pymilvus import MilvusClient
milvus_client = MilvusClient(uri="./milvus_demo.db")
collection_name = "my_rag_collection"

A few notes about MilvusClient:

Setting uri to a local file, such as ./milvus.db, is the easiest option because it automatically uses Milvus Lite and stores all data in that file.
If you have a large dataset, you can set up a higher-performance Milvus server on Docker or Kubernetes. In that setup, use the server URI, such as http://localhost:19530, as your uri.
If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, set uri and token to the public endpoint and API key from Zilliz Cloud.

Check whether the collection already exists. If it does, delete it.

if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)

Create a new collection with the specified parameters. If we do not specify field information, Milvus automatically creates a default id field as the primary key and a vector field to store vector data. A reserved JSON field stores scalar data that is not defined in the schema.

milvus_client.create_collection(
    collection_name=collection_name,
    dimension=embedding_dim,
    metric_type="IP",  # Inner product distance
    consistency_level="Strong",  # Strong consistency level
)

The IP metric means inner product similarity. Milvus also supports other metric types and index choices depending on the vector type and workload; see the guides on metric types and index selection. The Strong setting is one of the available consistency levels.

Insert the Embedded Documents

Iterate through the text data, create embeddings, and insert the data into Milvus. Here, we add a new field named text. Since it is not explicitly defined in the collection schema, it is automatically added to the reserved dynamic JSON field. For production metadata, review dynamic field support and the JSON field overview.

from tqdm import tqdm
data = []
doc_embeddings = embedding_model.encode_documents(text_lines)
for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")):
    data.append({"id": i, "vector": doc_embeddings[i], "text": line})
milvus_client.insert(collection_name=collection_name, data=data)
Creating embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 72/72 [00:00<00:00, 1222631.13it/s]
{'insert_count': 72, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], 'cost': 0}

For larger datasets, the same pattern can be extended with explicit schema design, vector field indexes, scalar indexes, and data lifecycle operations such as insert, upsert, and delete.

Build the RAG Retrieval Flow

Search Milvus for Relevant Context

Let’s define a common question about Milvus.

question = "How is data stored in milvus?"

Search the collection for the question and retrieve the top three semantic matches. This is a basic single-vector search. In production, you can combine it with filtered search, full-text search, multi-vector hybrid search, and reranking strategies to improve relevance.

search_res = milvus_client.search(
    collection_name=collection_name,
    data=embedding_model.encode_queries(
        [question]
    ),  # Convert the question to an embedding vector
    limit=3,  # Return top 3 results
    search_params={"metric_type": "IP", "params": {}},  # Inner product distance
    output_fields=["text"],  # Return the text field
)

Now let’s look at the search results for the query.

import json
retrieved_lines_with_distances = [
    (res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=4))
[
    [
        " Where does Milvus store data?\n\nMilvus deals with two types of data, inserted data and metadata. \n\nInserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).\n\nMetadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.\n\n###",
        0.6572665572166443
    ],
    [
        "How does Milvus flush data?\n\nMilvus returns success when inserted data are loaded to the message queue. However, the data are not yet flushed to the disk. Then Milvus' data node writes the data in the message queue to persistent storage as incremental logs. If `flush()` is called, the data node is forced to write all data in the message queue to persistent storage immediately.\n\n###",
        0.6312146186828613
    ],
    [
        "How does Milvus handle vector data types and precision?\n\nMilvus supports Binary, Float32, Float16, and BFloat16 vector types.\n\n- Binary vectors: Store binary data as sequences of 0s and 1s, used in image processing and information retrieval.\n- Float32 vectors: Default storage with a precision of about 7 decimal digits. Even Float64 values are stored with Float32 precision, leading to potential precision loss upon retrieval.\n- Float16 and BFloat16 vectors: Offer reduced precision and memory usage. Float16 is suitable for applications with limited bandwidth and storage, while BFloat16 balances range and efficiency, commonly used in deep learning to reduce computational requirements without significantly impacting accuracy.\n\n###",
        0.6115777492523193
    ]
]

Generate a RAG Answer with DeepSeek V4

Convert the retrieved documents into string format.

context = "\n".join(
    [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
)

Define the system and user prompts for the LLM. This prompt is assembled from the documents retrieved from Milvus.

SYSTEM_PROMPT = """
Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
"""
USER_PROMPT = f"""
Use the following pieces of information enclosed in  tags to provide an answer to the question enclosed in  tags.

{context}


{question}

"""

Use the model provided by DeepSeek V4-Pro to generate a response based on the prompt.

response = deepseek_client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT},
    ],
)
print(response.choices[0].message.content)
Milvus stores data in two distinct ways depending on the type:
- **Inserted data** (vector data, scalar data, and collection-specific schema) are stored in persistent storage as incremental logs. Milvus supports multiple object storage backends, such as MinIO, AWS S3, Google Cloud Storage, Azure Blob Storage, Alibaba Cloud OSS, and Tencent Cloud Object Storage. Before reaching persistent storage, the data is initially loaded into a message queue; a data node then writes it to disk, and calling `flush()` forces an immediate write.
- **Metadata**, generated by each Milvus module, is stored in **etcd**.

At this point, the pipeline has completed the core RAG loop: embed documents, store vectors in Milvus, search for relevant context, and generate an answer with DeepSeek V4-Pro.

What Should You Improve Before Production?

The demo uses simple section splitting and top-k retrieval. That is enough to show the mechanics, but production RAG usually needs more retrieval control.

The most important production habit is to evaluate retrieval separately from generation. If the retrieved context is weak, swapping the LLM often hides the problem instead of solving it.

Get Started with Milvus and DeepSeek RAG

If you want to reproduce the tutorial, start with the official Milvus documentation and the Build RAG with Milvus guide. For a managed setup, connect to Zilliz Cloud with your cluster endpoint and API key instead of running Milvus locally.

If you want help tuning chunking, indexing, filters, or hybrid retrieval, join the Milvus Slack community or book a free Milvus Office Hours session. If you would rather skip infrastructure setup, use Zilliz Cloud login or create a Zilliz Cloud account to run managed Milvus.

Questions Developers Ask About DeepSeek V4, Milvus, and RAG

Is DeepSeek V4 good for RAG?

DeepSeek V4-Pro is a strong fit for RAG when you need long-context processing and lower serving cost than premium closed models. You still need a retrieval layer such as Milvus to select relevant chunks, apply metadata filters, and keep the prompt focused.

Should I use GPT-5.5 or DeepSeek V4 for a RAG pipeline?

Use GPT-5.5 when answer quality, tool use, and live research matter more than cost. Use DeepSeek V4-Pro when long-context processing and cost control matter more, especially if your retrieval layer already supplies high-quality grounded context.

Can I run Qwen3.6–35B-A3B locally for private RAG?

Yes, Qwen3.6–35B-A3B is open weight and designed for more controllable deployment. It is a good candidate when privacy, local serving, multimodal input, or Chinese-language performance matters, but you still need to validate latency, memory, and retrieval quality for your hardware.

Do long-context models make vector databases unnecessary?

No. Long-context models can read more text, but they still benefit from retrieval. A vector database narrows the input to relevant chunks, supports metadata filtering, reduces token cost, and makes the application easier to update as documents change.

7 Best Open-Source Tools for Claude Code Context Management

Milvus — Thu, 11 Jun 2026 10:39:24 GMT

You can give Claude Code a 1M-token context window and still get worse answers over time. The issue is not only context size. It is context quality.

Claude Code sessions degrade when terminal logs, raw tool output, repeated file reads, verbose responses, and forgotten project history all compete for attention. In long-running agent workflows, that noise turns into a loop: the model loses the thread, you add more turns to fix the answer, and those extra turns add even more noise.

This is context defocus: the model has enough room to hold information, but the important information is buried under low-signal context. Bigger windows can make this easier to ignore because developers stop thinking carefully about what enters the prompt.

Prompt caching diagram showing how reused prefixes can still add billed context across turns

Prompt caching can reduce repeated-prefix cost, but it does not turn the context window into a junk drawer. You still pay for new tokens, and you still need the model to reason over the right information.

This article reviews seven open-source tools that attack context defocus from different layers: terminal output, tool output, codebase navigation, file reading, model verbosity, semantic code retrieval, and cross-session memory. It also explains how these ideas map to vector database design, vector similarity search, and retrieval systems such as Milvus.

What causes Claude Code context defocus?

Claude Code context defocus usually comes from five failure modes: too much raw instruction text, noisy tool output, repeated codebase exploration, long model responses, and memory gaps across sessions or agents.

Five causes of Claude Code context loss: redundant instructions, messy tool output, repeated codebase retrieval, long responses, and memory gaps

A good context-management stack should do three things: keep junk out, retrieve the right project knowledge on demand, and preserve durable decisions across sessions.

Which Claude Code context tool should you use first?

Start with the layer that creates the most noise in your workflow. If your terminal output is the problem, start with RTK. If Claude keeps wandering through a large repository, start with claude-context or code-review-graph. If your real pain is re-explaining the same decisions every day, start with memsearch.

The first five tools reduce what enters or remains in context. The last two make useful context easier to recall.

RTK compresses raw command output before Claude sees it

RTK is a CLI proxy for reducing token usage from common developer commands. Its GitHub description says it reduces LLM token consumption by 60–90% on common dev commands, and it ships as a single Rust binary.

In everyday Claude Code use, commands like git status, pytest, and directory listings often dump full environment info and status descriptions into the context window. The model usually needs only a smaller answer: which files changed, which test failed, where the PR is stuck, or what key files exist in the directory.

RTK sits between the shell and Claude. It can rewrite commands through Claude Code hooks and pass back compressed output.

Raw git status output:

On branch feat/payment-retry
Your branch is up to date with 'origin/feat/payment-retry'.

Changes not staged for commit:
  modified:   src/webhook/handler.ts
  modified:   src/queue/dlq.ts
  modified:   tests/webhook.test.ts

Untracked files:
  docs/notes.md

no changes added to commit

What actually matters:

3 modified, 1 untracked
- src/webhook/handler.ts
- src/queue/dlq.ts
- tests/webhook.test.ts

Same story with pytest. The raw output is full of passing cases and environment noise:

============================= test session starts =============================
platform darwin -- Python 3.12.4, pytest-8.4.1
collected 128 items

tests/test_auth.py ....................................
tests/test_webhook.py ....F....
tests/test_queue.py ...................................

================================== FAILURES ==================================
________________ test_retry_to_dlq __________________
E   AssertionError: expected status code 202, got 500

Compressed, the signal is immediate:

128 tests collected, 1 failed
FAIL tests/test_webhook.py::test_retry_to_dlq
AssertionError: expected status code 202, got 500

RTK is the easiest starting point when your context bloat comes from shell commands rather than code retrieval.

Context Mode sandboxes giant tool outputs outside the main chat

Context Mode is built for the raw blocks that tools return: test logs, browser DOM snapshots, GitHub payloads, MCP tool output, and scraped pages. Its GitHub description highlights context-window optimization for AI coding agents and reports 98% tool-output reduction.

Context Mode GitHub repository card showing sandboxed tool output and context optimization positioning

Its approach is to isolate large tool outputs into a local sandbox and index, then pass only summaries and retrieval handles into the Claude conversation.

Context Mode flow showing large tool output moving through sandbox execution, SQLite or FTS indexes, summaries, and retrieval results

The flow is useful because a coding agent often needs the failing node, broken selector, or relevant stack trace, not the entire DOM or every passing test line. Context Mode keeps the full output available locally while preventing it from dominating the main conversation.

This is similar to how production hybrid search systems separate storage from retrieval. You keep the raw data somewhere durable, then retrieve only the slice that matters.

code-review-graph maps code structure before Claude navigates it

code-review-graph addresses a different problem: Claude does not always need more text; it needs a better map.

code-review-graph logo image used in the original article

In a large repository, a simple question can trigger expensive exploration:

After changing this login logic, which files and tests are affected?

Without a code graph, Claude’s typical move is:

read auth.ts
grep login
read middleware
read tests
keep guessing

code-review-graph pre-builds a structural map of the codebase. It uses Tree-sitter to parse functions, classes, imports, call relationships, inheritance, and test dependencies, then writes the graph into SQLite.

That makes it useful for code review and blast-radius analysis. Instead of asking Claude to rediscover the dependency graph through repeated reads, you let it query structure first.

This is adjacent to semantic search, but not identical. A structural graph answers “what depends on what?” Semantic retrieval answers “what code is conceptually related to this question?” In real code-assistant workflows, you often want both.

Token Savior gives Claude symbol summaries before full files

Token Savior’s core idea is simple: do not send the full file by default. Send an index or symbol summary first, then expand only when the task needs more detail.

Token Savior GitHub repository card showing its MCP server description and project statistics

If you ask where a payment webhook is handled, the model often does not need every line of every related file. It first needs to know whether a file or symbol is relevant.

Token Savior serves code in layers:

This mirrors how developers actually read code. You scan, confirm relevance, then open the full file only when necessary. It also resembles the progressive retrieval pattern used in RAG applications: retrieve broadly enough to orient, then narrow the context before generation.

Caveman reduces Claude’s own response bloat

Most context tools focus on what enters the model. Caveman targets what Claude outputs.

Caveman is a Claude Code skill/plugin that strips filler, pleasantries, wrapper sentences, over-explanation, and repetitive structures. The goal is not to remove knowledge; it is to make the answer denser.

Without Caveman:

The reason your React component is re-rendering is likely because…

With Caveman:

New object ref each render. Inline object prop = new ref = re-render. Wrap in useMemo.

This matters because Claude’s own answers become future context. If every answer includes a long explanation, the next turn starts with more text than it needs. Shorter answers can improve the next turn as much as they improve the current one.

For teams thinking about context engineering for AI agents, Caveman is a reminder that output policy is part of context policy.

claude-context adds semantic code search through MCP

claude-context solves the repeated-codebase-exploration problem with semantic retrieval. It indexes a repository, stores code chunks in a vector database, and exposes search through the Model Context Protocol.

Claude Context repository shown on GitHub Trending in the original article

In a big codebase, you constantly ask Claude questions like:

Help me figure out which parts of the code might be related to this bug.

Without a retrieval layer, Claude’s default approach is often:

list the directory
grep around
read a bunch of files
keep guessing

claude-context moves that work into a retrieval layer. It chunks the repository, generates embeddings, stores them in a Milvus-backed code index, and retrieves relevant code chunks before the model starts reading files blindly.

claude-context flow showing codebase chunking, embeddings, vector database and hybrid search, relevant code retrieval, and Claude context injection

This is where AI coding tools start to look like search systems. You need chunking, embeddings, metadata, lexical matching, ranking, and freshness. Those are the same building blocks behind production RAG retrieval, hybrid retrieval routing, and embedding model selection.

memsearch keeps useful memory across sessions and agents

memsearch tackles the opposite side of the problem: not what to forget, but how to recall what matters.

memsearch logo image from the original article

Imagine you tell Claude on Monday:

Our webhook can’t retry on failure — failed events need to go into a dead letter queue.

On Wednesday, you open a new session and ask:

What else can we optimize in the webhook layer?

Without durable memory, Claude treats Monday’s decision as if it never happened. You explain it again.

memsearch stores memory as local, human-readable Markdown files and uses Milvus as a rebuildable retrieval index. That design keeps memory editable by humans while still making it searchable for agents.

At retrieval time, memsearch uses progressive recall: search first, expand if needed, then drill down to the original transcript only when necessary.

memsearch progressive retrieval flow showing search, expand, transcript, and summarized return to the main conversation

This Markdown-first pattern is useful for teams working across sessions, models, and agents. It also pairs naturally with long-term AI agent memory, shared multi-agent memory, and the broader problem of preventing context rot in agent systems.

How do these tools work together?

The seven tools are complementary, not interchangeable. Use them as layers.

A practical rollout order is:

Kill obvious noise first. Add RTK or Context Mode if shell output and tool payloads dominate your context.
Fix repository navigation. Add code-review-graph for structure or claude-context for semantic code retrieval.
Control what remains. Use Token Savior and Caveman to keep file reads and model responses compact.
Preserve durable knowledge. Use memsearch when repeated explanations become the bottleneck.

Keep in touch

Join the Milvus Discord community to ask questions and compare context-management patterns with other developers.
Book a free Milvus Office Hours session if you want help designing a retrieval layer for code, memory, or RAG workloads.
If you’d rather skip the infrastructure setup, Zilliz Cloud (managed Milvus) offers a free tier to get started.

Frequently Asked Questions

How do I reduce Claude Code token usage without losing useful context?

Start by compressing the noisiest inputs: terminal output, raw tool payloads, and repeated code reads. Then add retrieval tools such as claude-context or code-review-graph so Claude can pull relevant code instead of exploring the repository from scratch.

Should I use claude-context or code-review-graph for a large repo?

Use claude-context when you need semantic code search, especially when you do not know the exact file or symbol name. Use code-review-graph when you need structural answers such as call relationships, imports, test dependencies, and review blast radius.

Is memory different from code retrieval in Claude Code?

Yes. Code retrieval finds relevant project files or symbols. Memory retrieval recalls durable decisions, user preferences, debugging history, and cross-session lessons. memsearch focuses on memory; claude-context focuses on code retrieval.

Do these tools replace prompt caching or a larger context window?

No. Prompt caching and large context windows help with capacity and cost, but they do not decide what information deserves attention. Context-management tools improve the quality and density of what enters the model in the first place.