Stories by Fendy Feng on Medium

Why Context Engineering Is Becoming the Full Stack of AI Agents

Fendy Feng — Fri, 15 Aug 2025 06:19:25 GMT

AI agents are everywhere in 2025 — almost every company is building chatbots, coding assistants, and knowledge bases at breakneck speed. Building a cool demo is easy, but taking an agent from demo to a reliable, production-ready system is a different challenge. And almost every team that makes that leap runs into the same roadblock: context.

Your chatbot can’t see your customer history. Your coding assistant doesn’t understand your codebase. Your business AI can’t tap into real-time data or take meaningful action…

To work around this problem, teams have tried all kinds of fixes: carefully crafted prompts, sophisticated RAG implementations, 128k+ context windows, and standardized protocols like MCP to connect components.

All of these work. But they’re often siloed — developed and deployed independently. We end up with brittle systems that solve one piece of the puzzle but don’t give the AI a complete picture.

What if context wasn’t an afterthought? What if it was the foundation? That’s where Context Engineering comes in.

How to Understand “Context” in AI Agents

Before we dive into context engineering, we need to get clear on what “context” actually means — especially when building a reliable AI agent.

Context isn’t just the single prompt you send to an LLM. It’s everything the model sees before it generates a response. The more relevant and higher-quality the context is, the better the output and the smarter the agent’s actions. And the reverse is just as true: garbage in, garbage out.

In agentic AI systems, context comes from multiple sources — most of which current systems struggle to coordinate effectively:

Instructions / System Prompt — The initial rules that shape the model’s behavior, often with examples and constraints (e.g., “Act as a technical support agent with access to our knowledge base”)
User Prompt — The immediate task or question from the user (“My API calls are failing with 503 errors”)
State / History (Short-Term Memory) — The conversation so far, including both user and model messages
Long-Term Memory — Persistent knowledge gathered across many interactions, such as user preferences, summaries of past work, or remembered facts
Retrieved Information (RAG) — Relevant, up-to-date knowledge pulled from documents, vector databases like Milvus, or APIs (recent documentation, similar resolved issues)
Available Tools — Functions or built-in capabilities the model can call (e.g., check_system_status, create_support_ticket, escalate_to_engineer)
Structured Output Definitions — Expected formats for the model’s response, such as JSON schemas or specific formatting requirements

Image Credit: Philschmid (and thanks for Pilschimid’s insights)

The challenge? Most agents operate with only 20–30% of the context they could have — and it shows in their performance. They give generic responses when they should be personalized, suggest outdated solutions when fresh information exists, or fail to take action when they have the tools to help. Context engineering is about closing that gap.

So, What’s Context Engineering?

Since context is everything an AI needs to perform well, context engineering is the discipline of designing systems that ensure the model gets it — all of it, in the right format, at the right time.

This term might be new, but the idea isn’t. We’ve been moving toward it for years — just without giving it a name. Techniques like RAG, prompt engineering, function calling, MCP, and others are all pieces of the puzzle. Context engineering is about putting those pieces together into a coherent whole.

Think of it as the new full stack for building agentic AI. If traditional full-stack development connects the frontend, backend, and database into a working application, context engineering connects knowledge, tools, and reasoning into a seamless intelligence layer that agents can actually work with.

At its core, context engineering is about three key principles:

Dynamic adaptation — Context shapes itself to the current task and system state, not just static templates
Just-in-time assembly — The right information and tools arrive precisely when needed, not dumped all at once
Optimal formatting — Everything is structured so the LLM can understand and act on it effectively

When you treat context as infrastructure — not just text you drop into a prompt — you create agents that can reason about complete situations, take meaningful actions, and improve their performance over time.

How Context Engineering Differs from Prompt Engineering and RAG

Even after understanding context and context engineering, it’s easy to confuse them with prompt engineering or RAG — they share some techniques, but the scope and goals are different.

Prompt Engineering — The art of crafting inputs that guide an LLM’s behavior. This includes few-shot examples, role-play, formatting rules, and tone control. It’s powerful for shaping responses, but it can’t add missing knowledge or trigger real-world actions.

RAG (Retrieval-Augmented Generation) — One of the first solutions to reduce hallucination. It fetches relevant documents from a vector database (like Milvus) and injects them into the prompt at runtime. While great for keeping the model up to date, it focuses only on adding external knowledge — not managing task state, user preferences, or tool use.

Context Engineering — The umbrella discipline. It unifies retrieval, prompt design, tool orchestration, and dynamic adaptation into a single, engineered system. The goal is to ensure the agent always has the right information, tools, and formats — at the right time — to act effectively.

Think of it like this: prompt engineering is giving clear instructions, RAG is supplying the right ingredients, and context engineering is running the whole kitchen.

How Vector Databases Power Context Engineering

If context engineering is the new full stack, vector databases are its database layer — or long-term memory. That’s because the most relevant context for an AI agent almost always lives outside the LLM’s training data — in places like customer support transcripts, code repositories, knowledge base articles, sensor readings, and even images or audio files.

A vector database stores this information as embeddings, enabling semantic retrieval — finding what’s relevant based on meaning, not just keyword matches. This is critical for context engineering because it ensures the agent gets precisely the information it needs, exactly when it needs it.

Here’s how vector databases power context engineering:

Dynamic vector retrieval — Surface context relevant to the current task in real time, using semantic similarity rather than simple keyword search.
Multimodal support — Store and retrieve text, images, audio, video, or even embeddings from structured data in one system.
Freshness and updates — Keep the AI’s “working memory” current without retraining, enabling agents to adapt to new information instantly.
Scalability — Handle billions of vectors without degrading performance, supporting enterprise-scale deployments.

In a context-engineered system, vector databases don’t work alone. They operate alongside:

A prompt construction layer that formats retrieved data for the LLM.
A tool invocation layer that enables actions beyond text generation.
A feedback loop that refines retrieval and tool use based on results.

This turns the vector database from a passive storage engine into an active part of the agent’s reasoning process — not just answering queries, but shaping the decisions the agent makes.

Why Milvus Fits Perfectly into Context Engineering for Production Agents

When it comes to powering context engineering, particularly for building production-level AI Agents, not all vector databases are equal. You need a system that can handle huge volumes of embeddings, adapt to multiple data modalities, and deliver results in milliseconds — without becoming a bottleneck. That’s where Milvus comes in.

Milvus is an open-source vector database designed from the ground up for high-performance semantic search at scale. It’s built to store, index, and retrieve billions of vectors efficiently, making it ideal for scalable, production-level AI agents that depend on timely, high-quality context.

Here’s why Milvus stands out for context engineering:

Scale without compromise — Whether you’re indexing millions or billions of vectors, Milvus maintains low-latency retrieval for real-time applications.
Multimodal ready — Handle text, images, audio, video, and embeddings from structured data in one system.
Flexible deployment — Run it on your own infrastructure or in the cloud with Zilliz Cloud for a fully managed, hassle-free experience.
Rich ecosystem — Integrates seamlessly with RAG frameworks, AI development tools, and your existing data pipelines.
Hybrid search excellence — Combine semantic similarity with metadata filters and keyword search for complex business queries like “Find pricing documents John accessed in the last two weeks mentioning API rate limits with positive customer sentiment”

In a context-engineered architecture, Milvus is more than a data store — it’s the engine that ensures your AI agent always has the right knowledge at its fingertips. And when you pair it with Zilliz Cloud, you get enterprise-grade reliability, elasticity, and global availability, without having to worry about cluster management or scaling headaches.

Ready to Build Smarter Agents with Better Context?

The smartest AI agents aren’t powered by the biggest models — they’re powered by the best context. Context engineering makes it possible, and it starts with a robust vector database you can trust.

Milvus gives you open-source freedom and billion-scale performance. Zilliz Cloud takes it further with a fully managed service built on Milvus — enterprise-grade reliability, elastic scaling, and zero infrastructure headaches.

🚀 Get started your way:

Run Milvus locally or in your own environment.
Or try Zilliz Cloud for free with $100 in free credits — no setup, no ops.

You can also reach out to us to see what purpose-built vector infrastructure can do for your AI agents.

Claude Code vs. Gemini CLI vs. Cursor vs. Qwen Code — Comparing Top AI Coding Assistant

Fendy Feng — Tue, 29 Jul 2025 15:50:39 GMT

The AI coding assistant market got messy fast. There have been so many players: Claude Code, Gemini CLI, Cursor, and now the recently released Qwen Code. From a user perspective, the more players, the more options, the better.

But with different pricing models, feature sets, and philosophies, choosing the right tool has become a headache. After extensive research and community feedback from developers using these tools in production, here’s what you need to know to make the right choice.

BTW, at the end of this post, I will also share an open source project: Code Context, an open-source MCP plugin that adds semantic code search to all these coding assistants, giving them deep context from your entire codebase.

What Are They?

Claude Code is a terminal-native AI agent that embeds Claude 3.7 Sonnet directly into your command line. It operates as an agentic tool that can understand entire codebases, execute commands, and make coordinated changes across multiple files. The tool uses an incremental permission system, asking for approval at each step rather than making autonomous changes.

Gemini CLI brings Google’s Gemini 2.5 Pro to the command line as a fully open-source tool under Apache 2.0 license. It’s designed as a versatile local utility that integrates with Google Search and supports the Model Context Protocol (MCP) for extensibility. The tool can handle everything from coding to content generation and research.

Cursor is an AI-powered code editor built as a fork of VS Code. It integrates AI capabilities directly into the editing experience, providing real-time suggestions, autocomplete, and multi-file editing capabilities. The tool combines traditional IDE features with advanced AI through both quick completions and an Agent mode for complex tasks.

Qwen Code is a latest released command-line AI workflow tool built on the Qwen3-Coder-480B Mixture-of-Experts model. Adapted from Gemini CLI, it features enhanced parser support specifically optimized for Qwen models and can be integrated with other platforms like Claude Code and Cline.

Cost

Claude Code operates on a subscription model with pay-per-use characteristics. The Pro plan costs $20/month while the Max plan reaches $100/month. Real-world usage can get expensive quickly, with developers reporting costs around $8 for 90 minutes of intensive work. The pay-per-use nature means costs can be unpredictable for heavy users.

Gemini CLI offers the most generous free tier in the market. Users get 60 model requests per minute and 1,000 requests per day at no charge when using a personal Google account with a free Gemini Code Assist license. For professional usage, developers can upgrade to usage-based billing through Google AI Studio or Vertex AI.

Cursor uses a straightforward subscription model at $20/month for Pro features, which includes 500 premium model requests. This provides predictable monthly costs and makes budgeting easier for teams. The fixed pricing structure appeals to developers who want to avoid usage-based billing surprises.

Qwen Code costs depend entirely on your deployment choice. As an open-source model, you can self-host or use it through DashScope API. This flexibility potentially offers the lowest cost for high-volume usage, especially for organizations that can efficiently manage their own infrastructure.

Context Window

Claude Code leverages Claude 3.7 Sonnet’s context capabilities, though specific limits aren’t publicly detailed. In practice, developers report it can handle massive codebases and files, with successful operations on 18,000+ line files that other tools failed to process.

Gemini CLI provides access to Gemini 2.5 Pro’s massive 1 million token context window. This enormous context capacity makes it excellent for understanding large codebases and handling complex, multi-file operations without losing track of dependencies and relationships.

Cursor uses a mix of purpose-built and frontier models with varying context windows depending on the specific model being used. The tool is optimized for typical development workflows and handles most coding projects effectively, though it may not match the extreme context capabilities of some competitors.

Qwen Code supports 256K tokens natively and can be extended up to 1M tokens using extrapolation methods like YaRN. This large context window is specifically optimized for repository-scale operations and dynamic data like pull requests, making it well-suited for agentic coding tasks.

Capabilities & Features

Claude Code excels at deep codebase understanding, complex debugging, and systematic problem-solving. It can read terminal logs, understand linting errors, run CLI commands, and handle entire GitHub workflows from issue analysis to PR submission. The tool integrates with test suites and build systems while maintaining strong version control integration.

Gemini CLI combines coding capabilities with broader utility functions. It can ground prompts with Google Search, supports MCP for extensibility, allows custom prompts and instructions, and can be automated through scripting. The tool handles both development tasks and general research, making it versatile beyond pure coding.

Cursor provides sophisticated autocomplete that often predicts developer intent, multi-file editing capabilities through Agent mode, and seamless integration with existing VS Code workflows. Features include diff previews, versioned checkpoints, integrated terminal, and Bug Bot for AI-powered code review during development.

Qwen Code offers strong agentic coding capabilities with enhanced parser support for Qwen models. It can handle code understanding, editing large codebases, workflow automation, and complex operational tasks like handling pull requests and rebases. The tool achieves 37.5% accuracy on agentic coding benchmarks.

Performance

Claude Code consistently produces higher-quality code requiring fewer iterations. Developers report superior debugging capabilities and better handling of complex architectural decisions. It can successfully work with massive files and codebases that challenge other tools, though it operates more slowly due to its permission-based approach.

Gemini CLI delivers solid performance with the advantage of Google Search integration for real-time information access. The large context window enables handling of complex projects, though it may not match Claude Code’s reasoning depth for sophisticated debugging scenarios.

Cursor provides the fastest user experience for routine coding tasks. Its tab-complete functionality is exceptionally responsive, often feeling predictive rather than reactive. Agent mode can handle complex refactoring effectively, though it sometimes suggests unnecessary changes or tries to accomplish too much in a single iteration.

Qwen Code demonstrates strong performance on agentic coding tasks, with benchmark results showing competitive accuracy. The tool benefits from being specifically optimized for coding workflows, though real-world performance can vary depending on infrastructure and configuration choices.

Ecosystem Integration and Community Support

Claude Code integrates well with existing development tools and workflows, particularly version control systems and CI/CD pipelines. Being developed by Anthropic, it receives regular updates and improvements, though the community around it is smaller compared to open-source alternatives.

Gemini CLI benefits from being fully open-source with active community contribution encouraged. It supports emerging standards like MCP and can be extended through plugins. The Google backing provides stability, while the Apache 2.0 license enables community-driven development and transparency.

Cursor has built a strong community of developers, particularly those migrating from VS Code. The tool maintains compatibility with VS Code extensions, themes, and keybindings, making adoption seamless. Regular feature updates and responsive development have created positive momentum in the developer community.

Qwen Code leverages the broader open-source AI community and integrates with multiple platforms including Claude Code and Cline. As part of the Qwen ecosystem, it benefits from ongoing model improvements and community contributions, though the ecosystem is still developing compared to more established tools.

Ease of Use

Claude Code requires comfort with terminal-based workflows, which can be a barrier for some developers. However, once familiar with the interface, many find the conversational approach intuitive. The permission-based system adds friction but builds trust and understanding of what the tool is doing.

Gemini CLI offers straightforward command-line usage with simple installation through npm. The interface is clean and the extensive free tier allows for experimentation without cost concerns. Being terminal-based, it requires similar comfort levels as Claude Code.

Cursor provides the most accessible experience for developers familiar with VS Code. The migration process is seamless, importing existing configurations in one click. The GUI-based approach with visual feedback makes it immediately familiar to most developers, reducing the learning curve significantly.

Qwen Code requires more technical setup and configuration compared to commercial alternatives. While this provides flexibility, it also means a steeper learning curve and more initial investment in setup and configuration. Documentation is still developing as the project matures.

Security and Privacy

Claude Code offers enterprise-grade security but processes code through Anthropic’s servers. While Anthropic has strong security practices, sensitive codebases may require careful consideration of what information is shared with external services.

Gemini CLI being open-source allows for security auditing, though it typically processes requests through Google’s servers. The transparency of the codebase enables security review, but data handling follows Google’s privacy policies for API usage.

Cursor provides SOC 2 certification and offers Privacy Mode where code is never stored remotely without explicit consent. This addresses major privacy concerns while maintaining the benefits of cloud-based AI processing. The privacy controls are granular and developer-friendly.

Qwen Code offers the most control over security and privacy since it can be completely self-hosted. Organizations can run the model on their own infrastructure, ensuring sensitive code never leaves their environment. This makes it attractive for companies with strict security requirements.

When to Use Each

Choose Claude Code when working on complex, large codebases where debugging and code quality are critical. It’s ideal for professional development teams that can justify the premium cost through improved code quality and reduced debugging time. Best for developers comfortable with terminal workflows who need sophisticated reasoning capabilities.
Choose Gemini CLI when starting with AI coding tools, thanks to its generous free tier, or when work involves frequent research and documentation lookup. It’s excellent for learning new frameworks, working across multiple environments, and for developers who value open-source transparency and community-driven development.
Choose Cursor for developers prioritizing speed, rapid prototyping, and a polished IDE experience. It’s perfect for teams migrating from VS Code who want immediate productivity gains with minimal learning curve. Ideal for day-to-day coding tasks, quick iterations, and developers who prefer GUI-based workflows.
Choose Qwen Code when you need cutting-edge open-source AI with maximum customization control. It’s best for organizations building internal developer tools, teams with specific security requirements, or those who want complete control over their AI coding infrastructure while potentially achieving the lowest long-term costs.

AI Coding Copilots All Have a Code Search Problem. Fix it with Code Context.

While these coding tools are powerful, they all have a code search problem. They can’t search context properly. Claude still uses outdated grep. Cursor uses very simple vector search. Their context retrieval? Honestly, pretty mediocre. Gemini CLI and Qwen Code, the same.

So, to understand your codebase and retrieve the right code snippet for your needs, your coding assistant needs to understand and search the context first. Code Context is one of the solutions to this problem.

Code Context is an open-source, MCP-compatible plugin that transforms any AI coding agent into a context-aware powerhouse. It’s like giving your AI assistant the institutional memory of a senior developer who’s worked on your codebase for years. Whether you’re using Qwen Code, Claude Code, Gemini CLI, working in VSCode, or even coding in Chrome, Code Context brings semantic code search to your workflow.

Semantic Code Search via Natural Language
Multi-Language Support: Search seamlessly across 15+ programming languages, including JavaScript, Python, Java, and Go, with consistent semantic understanding across them all.
AST-Based Code Chunking: Code is automatically split into logical units, such as functions and classes, using AST parsing, ensuring search results are complete, meaningful, and never cut off mid-function.
Live, Incremental Indexing: Code changes are indexed in real time. As you edit files, the search index stays up to date — no need for manual refreshes or re-indexing.
Fully Local, Secure Deployment: Run everything on your own infrastructure. Code Context supports local models via Ollama and indexing via Milvus, so your code never leaves your environment.
First-Class IDE Integration: The VSCode extension lets you search and jump to results instantly — right from your editor, with zero context switching.
MCP Protocol Support: Code Context speaks MCP, making it easy to integrate with AI coding assistants and bring semantic search directly into their workflows.
Browser Plugin Support: Search repositories directly from GitHub in your browser — no tabs, no copy-pasting, just instant context wherever you’re working.

Check its Github repo here and give it a try. Feel free to share with us your feedback.

The Great AI Agents Protocol Race: Function Calling vs. MCP vs. A2A

Fendy Feng — Sat, 28 Jun 2025 16:03:01 GMT

If you’ve been keeping an eye on the AI dev world lately, you’ve probably noticed something: everyone is now talking about AI Agents — not just smart chatbots, but full-blown autonomous programs that can use tools, call APIs, and even collaborate with each other. LangChain and OpenAI even had a debate over the definition of “AI Agents.”

But as soon as you start building serious AI Agent systems, one big headache hits you: there’s no clear, universal way for Agents to work with tools — or with each other.

Right now, three major approaches are competing to define the future of AI agent architecture:

Function Calling: OpenAI’s pioneering approach — teaching LLMs to make API calls like junior developers
MCP (Model Context Protocol): Anthropic’s attempt to create a standard toolkit interface across models and services.
A2A (Agent-to-Agent Protocol): Google’s brand-new spec for letting different Agents talk to each other and work as a team.

Every major AI player — OpenAI, Anthropic, Google — is quietly betting that whoever defines these standards will shape the future agent ecosystem.

For developers building beyond basic chatbots, understanding these protocols isn’t just about keeping up — it’s about avoiding painful rewrites down the road.

Here’s what we’ll cover in his post:

What is Function Calling Why it made tool use possible — but why it’s not enough.
How MCP tries to fix the mess by creating a real protocol for tools and models.
What A2A adds by making Agents work together like teams, not loners.
How you should actually think about using them (without wasting time chasing hype).

Function Calling: The Pioneer with Growing Pains

Function Calling, popularized by OpenAI and now adopted by Meta, Google, and others, was the first mainstream approach to connecting LLMs with external tools. Think of it as teaching your LLM to write API calls based on natural language requests.

Figure 1: Function calling workflow (Credit @Google Cloud)

The workflow is straightforward:

User asks a question (“What’s the weather in Seattle?”)
LLM recognizes it needs external data
It selects the appropriate function from your predefined list
It formats parameters following JSON Schema: 5

{
  "location": "Seattle",
  "unit": "celsius"
}

Your application executes the actual API call
The LLM incorporates the returned data into its response

For developers, Function Calling feels like giving your AI a cookbook of API recipes it can follow. For simple applications with a single model, it’s nearly plug-and-play.

But there’s a significant drawback when scaling: no cross-model consistency. Each LLM provider implements function calling differently. Want to support both Claude and GPT? You’ll need to maintain separate function definitions and handle different response formats.

It’s like having to rewrite your restaurant order in a different language for each chef in the kitchen. This M×N problem becomes unwieldy fast as you add more models and tools.

Function Calling also lacks native support for multi-step function chains. If the output from one function needs to feed into another, you’re handling that orchestration yourself.

MCP (Model Context Protocol): The Universal Translator for AI and Tools

MCP (Model Context Protocol) addresses precisely these scaling issues. Backed by Anthropic and gaining support across models like Claude, GPT, Llama, and others, MCP introduces a standardized way for LLMs to interact with external tools and data sources.

How MCP Works

Think of MCP as the “USB standard for AI tools” — a universal interface that ensures compatibility:

Tools advertise their capabilities using a standardized format, describing available actions, required inputs, and expected outputs
AI models read these descriptions and can automatically understand how to use the tools
Applications integrate once and gain compatibility across the AI ecosystem

MCP transforms the messy M×N integration problem into a more manageable M+N problem.

The MCP Architecture

MCP uses a client-server model with four key components:

Figure 2- The MCP architecture (Credit @Anthropic)

MCP Hosts: The applications where users interact with AI (like Claude Desktop or AI-enhanced code editors)
MCP Clients: The connectors that manage communication between hosts and servers
MCP Servers: Tool implementations that expose functionality through the MCP standard
Data Sources: The underlying files, databases, APIs and services that provide information

If Function Calling is like having to speak multiple languages to different chefs, MCP is like having a universal translator in the kitchen. Define your tools once, and any MCP-compatible model can use them without custom code. This dramatically reduces the marginal cost of adding new models or tools to your application. As someone who’s dealt with integration headaches, that’s music to my ears.

A2A (Agent-to-Agent Protocol): The Team Coordinator for AI Agents

While Function Calling and MCP focus on model-to-tool interaction, A2A (Agent-to-Agent Protocol), introduced by Google, tackles a different challenge: How do we get multiple specialized agents to collaborate effectively?

As AI agent architectures grow more complex, it quickly becomes clear that no single agent should handle everything. You might have one agent specialized in document summarization, another in database queries, and another in user interaction.

A2A defines a lightweight, open protocol that lets different Agents:

Discover each other and advertise their capabilities,
Delegate tasks dynamically to the best-suited Agent,
Coordinate progress and share real-time updates securely.

Figure 3- How A2A works (credit @Google)

A2A facilitates communication between a “client” agent that manages tasks and a “remote” agent that executes them. If Function Calling gives an agent access to tools, A2A lets agents form effective teams.

Consider hiring a software engineer: A hiring manager could task their agent to find candidates matching specific criteria. This agent then collaborates with specialized agents to source candidates, schedule interviews, and facilitate background checks — all through a unified interface.

Quick Comparison: Function Calling vs MCP vs A2A

It’s tempting to see these protocols as competitors, but they actually solve different pieces of the agent ecosystem puzzle:

Function Calling connects models to individual tools (limited but simple)
MCP standardizes tool access across different models (more scalable)
A2A enables collaboration between independent agents (higher-level orchestration)

In architectural terms, MCP answers “what tools can my agent use?” while A2A handles “how can my agents work together?”

This resembles how we structure complex software: individual components with well-defined interfaces, composed into larger systems. An effective agent ecosystem needs both tool interfaces (Function Calling/MCP) and inter-agent communication (A2A).

What This Means for Developers

So, what should you, as a developer building with AI, do with these competing standards?

For simple applications: Function Calling remains the quickest path to adding tool use to your LLM application, especially if you’re only using one model provider.
For cross-model compatibility: Consider adopting MCP, which gives you broader model support without duplicating integration work.
For complex multi-agent systems: Keep an eye on A2A, which could become crucial as agent ecosystems mature.

The smart play might be to layer these approaches: use Function Calling for quick prototyping, but implement MCP adapters for better scalability, with A2A orchestration for multi-agent workflows.

The Road Ahead

The conversation around what makes an “AI Agent” is still evolving — sometimes even debated between companies like OpenAI, Anthropic, and LangChain.

But regardless of definitions, one thing is clear: Standards like Function Calling, MCP, and A2A are laying the foundation for the next generation of AI applications.

For developers, understanding these patterns early is an investment in future-proofing your work. It’s how we move from toy demos to production-ready systems — the kind that solve real problems at scale. The agent ecosystem is developing rapidly, and building on these protocols now means positioning your applications for what’s coming next.

What do you think? Which protocols are you using in your AI projects? Are you betting on one standard winning out, or preparing for a multi-protocol future?

What Exactly Are AI Agents? Why OpenAI and LangChain Are Debating Over Their Definition?

Fendy Feng — Sun, 22 Jun 2025 08:47:50 GMT

TL;DR

At the simplest level, AI agents are software programs powered by artificial intelligence that can perceive their environment, make decisions, and take actions to achieve a goal — often autonomously.
OpenAI and LangChain recently debated what truly defines an agent — simplicity vs. flexibility is the core divide.
Agents differ from LLMs, chatbots, and workflows by being goal-driven, tool-using, and proactive.
AI Agents are already used in coding, business ops, healthcare, education, personal productivity, and many other areas.

🥊 The OpenAI vs. LangChain “AI Agent” Debate

The AI community witnessed a fascinating debate in early 2025 when OpenAI released its comprehensive guide to AI agents, which prompted a swift response from LangChain. This public exchange highlighted fundamental differences in how major players conceptualize AI agents and revealed important distinctions that every developer should understand.

Let’s talk drama first. 🙂

What Happened? What Sparked the Controversy?

OpenAI, in their new documentation for the Assistants API, explained how to build agents using their platform, complete with tools, memory, threads, and a planning architecture.
However, they described AI agents in a high-level, somewhat simplified manner: as large language models (LLMs) with memory and tools that can achieve goals.
Then, LangChain, whose entire framework revolves around agent workflows, dropped a response blog: “How to Think About Agent Frameworks”. And it didn’t pull punches.

LangChain’s Core Argument:

LangChain argued that OpenAI’s guide:

Oversimplifies what agents are — reducing them to just tool-using LLMs.
Misrepresents existing frameworks — implying LangChain-style agents are unstable or unreliable because of flaws in the architecture, not because of current limitations in LLM reasoning.
Ignores the core “agent loop” — the concept of an agent continuously reasoning and deciding what to do next is critical, and it’s not front and center in OpenAI’s model.

Why Do They See It Differently?

This isn’t just a clash of opinions — it’s a difference in philosophy and design priorities:

Who’s “Right”?

Honestly? Both have good points.

OpenAI wants to productize agents safely and cleanly for the average developer.
LangChain wants to push the boundaries of autonomy and reasoning, even if it’s messier.

So if you’re just getting started and want something that works? OpenAI’s Assistants API is solid. If you’re building ambitious workflows and need total control? LangChain might be the better fit.

The good news: this debate is fueling clarity in the space. It’s pushing the whole AI world to ask: “What does it really mean to build an autonomous, intelligent, goal-driven AI system?”

And that’s the question we’ll dig into for the rest of this post.

🔍 So, What Exactly Are AI Agents?

Imagine waking up to find your coffee already brewing, your calendar optimized for the day, and your inbox sorted with draft responses ready for your approval. Meanwhile, your code repository has been scanned overnight, bugs fixed, and tests automatically generated. Welcome to the future.

At the simplest level, AI agents are software programs powered by artificial intelligence that can perceive their environment, make decisions, and take actions to achieve a goal — often autonomously. Unlike traditional software that follows rigid, pre-programmed instructions, AI agents can operate with varying degrees of autonomy, learning from their interactions and adapting their behavior accordingly.

Think of an AI agent as a digital assistant on steroids — one that doesn’t just respond to your commands but anticipates needs, solves problems, and accomplishes tasks with minimal human supervision. The key distinction is autonomy and goal-orientation: agents are built to pursue objectives rather than simply process inputs.

To put it in everyday terms, if traditional software is like a bicycle that goes exactly where you steer it, an AI agent is more like a self-driving car that gets you to your destination while handling the navigation details itself.

How AI Agents Work

Let’s peek under the hood of these AI agents. At their core, AI agents follow what we call a “perception-think-action loop” — but don’t let the fancy term intimidate you. It’s actually pretty intuitive when you break it down:

The Perception-Think-Action Loop

Think of this as the agent’s basic rhythm:

Perception: First, your agent takes in information. This could be your typed request, data from APIs, sensor readings, or even the content of files. It’s basically gathering all the context it needs.
Reasoning: Now comes the thinking part. The agent (usually powered by a Large Language Model or LLM) processes what it’s perceived. It’s asking itself: “What’s really being asked here? What’s the goal? What information do I have and what do I need?”
Planning: This is where agents really shine compared to simpler AI systems. The agent maps out a sequence of steps to achieve the goal. If the task is complex, it might break it down into sub-tasks and determine dependencies.
Action: Time to get things done! The AI agent executes its plan by utilizing the tools at its disposal — it may call an API, query a vector database, generate code, or even control physical devices if they are connected to it.
Learning & Adaptation: After taking action, the agent evaluates the results. Did it work? If not, why? It uses this feedback to adjust its approach, either immediately for the current task or to improve future performance.

Let me share how this works with a concrete example. Say you tell your coding agent: “Create a weather dashboard for my city.”

Perception: It processes your request and understands you want a weather dashboard application.
Reasoning: It determines it needs to: find your location, access weather data, create a visualization interface, and package it as a usable application.
Planning: It maps out steps like:

First, determine your location (either ask you or use default settings)
Research weather APIs that offer the needed data
Design a UI layout with key weather metrics
Write front-end code for visualization
Set up API connections to fetch real-time data
Package everything into a deployable application

4. Action: The agent starts executing these steps. It might ask you for your location, generate API authentication code for a weather service, create HTML/CSS/JS for the dashboard, and test that the data flows correctly.

5. Learning: If you say the temperature display is too small, it adapts and regenerates that component with a larger font. It remembers this preference for future tasks.

Key Components of an AI Agent

Modern AI agents are complex systems comprised of several critical components working together to create intelligent, goal-oriented behavior. Let’s break down these essential building blocks:

1. Foundation AI Models

At the core of most AI agents is a foundation model, typically a Large Language Model (LLM) like GPT-4, Claude, or Llama that provides the reasoning capabilities. These models act as the “brain” of the agent, enabling it to:

Process and generate natural language
Understand context and nuance
Apply common sense reasoning to novel situations
Generate plans and evaluate alternatives

The choice of foundation model significantly impacts an agent’s capabilities, with more advanced models generally offering better reasoning but at higher computational costs.

2. Memory Systems

Unlike simple chatbots, sophisticated AI agents maintain various types of memory:

Short-term memory: Keeps track of the current conversation or task context
Long-term memory: Stores persistent information like user preferences or learned knowledge
Episodic memory: Records specific interactions or “experiences” for future reference

For instance, a customer service agent remembering your previous issues when you contact support again exemplifies effective memory utilization.

Vector databases like Milvus and Zilliz Cloud usually play a key role in powering the memory system of AI agents.

3. Tool Use Systems

Today’s most capable agents can leverage external tools to overcome the limitations of language models alone:

API connections to external services
Search engines and knowledge bases
Database access
Code execution environments
Other specialized AI models (like image generators)

This tool use capability transforms agents from passive responders to active problem-solvers that can affect the world outside their language model.

4. Planning and Reasoning Systems

Advanced agents incorporate explicit planning components that help them break down complex goals:

Task decomposition: Breaking larger goals into manageable subtasks
Reasoning chains: Using techniques like chain-of-thought (COT) to work through problems step-by-step
Self-reflection: Evaluating the quality of their own plans and outputs
Feedback incorporation: Learning from successes and failures to improve future plans

5. Agent Frameworks and Orchestration

Most production AI agents are built on specialized frameworks that handle the complex integration of the above components. For example:

LangChain: Provides modular components for building agents with memory, tool-use capabilities, and prompt management in a flexible architecture
LlamaIndex: Specializes in knowledge-intensive applications, particularly for retrieving and reasoning over document collections
OpenAI Agents SDK: offers a simplified framework focused on reliable tool use with OpenAI’s models

These frameworks handle the complex plumbing needed for agents to function reliably, providing developers with abstractions for common agent patterns. Check out this blog for most popular AI frameworks: 10 Open-Source LLM Frameworks Developers Can’t Ignore in 2025

6. Knowledge Retrieval Mechanisms

Truly useful agents need access to specific knowledge:

RAG (Retrieval-Augmented Generation): Allows agents to pull relevant information from documents or databases before generating responses
Knowledge graphs: Provide structured relationships between concepts for more precise reasoning
Vector search: Enables semantic similarity matching rather than just keyword lookups
Hybrid retrieval: Combines multiple approaches for more robust information access

The knowledge component is often what transforms a generic agent into a domain-specific expert that can provide genuinely valuable insights or assistance.

7. Security and Safety Systems

As agents gain more capabilities, safeguards become increasingly important:

Input filtering: Screens requests for harmful content
Output moderation: Ensures responses meet safety guidelines
Authorization boundaries: Limits what actions agents can take
Monitoring systems: Tracks agent behavior and performance
Explainability tools: Makes agent reasoning transparent to users and developers

These systems transform experimental agents into reliable, production-ready systems that can be trusted in real-world environments.

Vector Databases: The Backbone of Long-Term Agent Memory

As mentioned above, AI agents to function effectively, they need a robust memory system that extends beyond short-term context. This is where vector databases emerge as a critical infrastructure component powering sophisticated agent architectures.

Vector databases store information as high-dimensional vectors — mathematical representations that capture the semantic meaning of data whether it’s text, images, audio, or other unstructured formats. This approach allows agents to perform similarity searches and retrieve contextually relevant information based on meaning rather than exact keyword matches. For example, when an agent encounters a new query, it can access its memory system to retrieve similar past interactions or relevant knowledge, enabling it to make informed decisions and adapt to new situations. Without such memory, agents would lack the continuity required for advanced reasoning and adaptive learning.

AI Agents vs. Other AI Systems

OK, so now you’re probably wondering, “How are AI agents different from all the other AI stuff I’ve been using?” Great question! Let’s clear up some confusion by comparing agents with their AI cousins:

AI Agents vs. LLMs (Even Advanced Ones)

Think of modern LLMs like GPT-4, Claude, or DeepSeek as incredibly powerful brains waiting for direction. Here’s what separates them from true agents:

LLMs by themselves:

Function as “stateless” systems — forgetting context between sessions unless explicitly reminded
Generate impressive text, but can’t take actions beyond the chat interface
Respond to prompts rather than independently pursuing objectives

Even cutting-edge models with reasoning capabilities (like Claude 3.7 Sonnet with extended thinking or DeepSeek R1) and built-in search:

Can break down complex problems step-by-step
Access real-time information beyond their training data
Produce sophisticated analysis and explanations
But still operate within a reactive, prompt-response paradigm

What transforms an LLM into an agent:

Persistent memory architecture using vector databases and state management
Tool integration frameworks that enable a diverse action space
Planning systems that maintain progress toward defined goals
Feedback loops that allow adaptation based on outcomes

The difference is like having a brilliant consultant (LLM) versus an autonomous colleague (agent). The consultant gives excellent advice when asked but forgets you between meetings. The agent remembers your preferences, anticipates needs, takes initiative on your behalf, and learns from each interaction to serve you better over time.

AI Agents vs. AI Assistants

This is a subtle but important distinction that confuses many developers. AI assistants (like the basic versions of Siri, Alexa, or even Claude) are designed primarily to help users through conversation and simple predefined actions. They’re focused on the human-AI interaction.

AI agents go a step further:

They can operate independently, even when you’re not directly interacting with them
They have more agency to make decisions within their scope
They often work in the background on longer-running tasks
They can be more proactive rather than just reactive

For example, an AI assistant might help you book a flight when you ask it to. An AI agent might notice you’ve been discussing a trip, proactively research flight options based on your calendar availability, and then suggest the best times to book based on price trends it’s been monitoring.

AI Agents vs. AI Workflows

If you’ve built AI applications before, you might have created workflow chains or pipelines. These are predetermined sequences of AI operations linked together. While useful, they differ from agents in critical ways:

AI workflows are like assembly lines — efficient but rigid. They follow the same steps every time, and if something unexpected happens, they often break down. Agents are more like skilled workers who can adapt their approach based on circumstances.

Types of AI Agents

Not all AI agents are created equal. Let me walk you through the main types seen in the wild, with real examples that might help you understand their unique characteristics:

Task-Specific Agents

These are specialized agents designed to excel at particular jobs. They’re like expert contractors you bring in for specific work.

Example: GitHub Copilot for Docs

This coding documentation agent doesn’t just generate documentation — it reads codebases, understands function signatures and dependencies, analyzes existing documentation patterns, and then creates contextually appropriate docs that match team styles. It can work across multiple files, maintaining consistency in terminology and approach.

Autonomous Agents

These agents can work independently over extended periods with limited supervision. They’re more like employees than tools.

Example: AutoGPT

One of the first autonomous agents that caught widespread attention. You give it a high-level goal like “Create a successful blog about renewable energy,” and it breaks this down into subtasks: researching current trends, identifying target audiences, planning content categories, drafting articles, finding relevant images, setting up publishing schedules, and analyzing traffic patterns to optimize future content. It can spend days or weeks pursuing these goals, making adjustments based on results.

Multi-Agent Systems

These involve multiple specialized agents working together, like a team with different roles.

Example: AgentVerse

This framework exemplifies the multi-agent approach. In a content production environment, it might deploy:

A research agent that gathers information on trending topics
A planning agent that outlines content structure
Multiple specialist writers focused on different aspects (technical details, beginner explanations, etc.)
An editor agent that ensures consistency across pieces
A feedback agent that analyzes user engagement
A coordinator agent that manages workflows and resolves conflicts

The magic happens in the interactions — agents can debate approaches, request clarification from each other, and collaboratively solve problems in ways none could individually.

Embodied Agents

These agents control or interact with physical systems in the real world.

Example: Amazon’s Warehouse Robots

These have evolved from simple path-following machines to sophisticated agents that adaptively navigate dynamic environments. They can reroute around obstacles, prioritize packages based on shipping deadlines, coordinate with other robots to prevent bottlenecks, and even predict and preposition themselves for anticipated order volumes.

Use Cases for AI Agents

Let’s explore how AI agents are actually being used right now across different industries. These examples represent what’s truly possible with today’s technology:

Software Development

In modern development workflows, coding agents transform productivity. A modern coding agent doesn’t just write code snippets — it functions as a true development partner. Feed it a product spec, and it will architect a solution, generate the code across multiple files and functions, create appropriate tests, and then help debug any issues.

For example, at recent hackathons, teams have used agents to build entire image processing applications. The agent handles everything from setting up the React frontend to implementing the backend APIs and database schema. When teams run into performance bottlenecks with large image processing, the agent analyzes the code, identifies the issue, and implements a more efficient algorithm, complete with proper error handling and edge case management. What would take days of work is accomplished in hours.

Business Operations

Finance departments have been early adopters of agent technology. Many CFOs deploy accounting agents that completely transform month-end close processes. These agents don’t just process transactions — they reconcile accounts across multiple systems, identify discrepancies, follow up on missing documentation, prepare financial statements with explanatory notes, and even suggest journal entries to correct issues they discover.

The game-changer is how they handle exceptions. Rather than simply flagging problems for humans to resolve, they can reason through complex accounting rules to suggest appropriate treatments for unusual transactions. When encountering truly novel situations, they research accounting standards, propose solutions with citations to relevant guidance, and learn from accountants’ feedback to handle similar situations autonomously in the future.

Healthcare

Healthcare providers are using monitoring agents that go far beyond traditional alert systems. Hospitals implement patient monitoring agents that integrate data from electronic health records, bedside monitors, medication administration systems, and lab results. These agents don’t just notify staff when readings exceed thresholds — they understand clinical context.

For instance, when a patient’s oxygen saturation drops, the agent checks recent medication administration, position changes, and historical patterns for that patient. It can distinguish between temporary fluctuations and concerning trends, only alerting staff when truly necessary. Over time, it learns each patient’s baseline and normal variations, dramatically reducing false alarms while catching subtle early warning signs of deterioration that static monitoring would miss.

Education

Educational agents are evolving from simple tutoring programs to comprehensive learning companions. University professors develop research mentor agents to support graduate students. These agents don’t just answer questions — they help shape the entire research process.

When a student begins a project, the agent helps refine research questions, suggests methodological approaches, identifies potential difficulties, and maps out a realistic timeline. As the student progresses, it reviews drafts, suggests improvements to experimental design, helps interpret results, and provides guidance on presenting findings effectively. Most impressively, it adapts its support based on each student’s strengths, weaknesses, and learning style — providing more structure for those who need it while encouraging independence in others.

Personal Productivity

Personal productivity agents are perhaps the most accessible use case for most people. A robust productivity agent transforms workload management. It’s not just a glorified to-do list — it’s a genuine workload management partner.

It tracks projects across multiple tools (email, task managers, documents, calendar), identifies dependencies and potential conflicts, and proactively suggests schedule adjustments. When receiving new requests, it evaluates them against current commitments and helps determine what to prioritize or delegate. It drafts appropriate responses based on communication style and relationship with each person.

What makes it truly valuable is how it learns preferences and working patterns over time. It recognizes which times of day are most suited for creative work versus meetings, which tasks tend to be procrastinated on, and how long similar tasks have typically taken in the past. It uses this knowledge to suggest realistic schedules that work with actual habits rather than some idealized productivity system.

Challenges and Considerations

While AI agents present incredible opportunities, they also come with significant challenges that we need to address as developers and users:

Alignment Problems: When Agents Go Off-Track

Consider an email management agent designed to prioritize inbox messages. Despite clear instructions about what “important” means, the agent might flag all messages from a manager as urgent (including lunch invitations) while categorizing client emergency requests as “can wait until tomorrow.” Why? Because it observed the user responding quickly to their boss several times and learned the wrong pattern from this behavior.

This is what’s called an alignment problem — when agents optimize for goals that don’t match the user’s actual intentions. As agents gain more capabilities and autonomy, ensuring they accurately understand true objectives becomes critically important. The issue isn’t about malicious AI but rather misunderstandings that can have significant consequences when agents have meaningful power to act independently.

The Black Box Problem: Why Did It Do That?

Have you ever had an agent make a decision that left you scratching your head? I remember reviewing code changes made by an agent that completely restructured our authentication system. The changes worked, but I had no idea why the agent thought this approach was better.

Without transparency into agent reasoning, it’s difficult to trust their decisions or learn from their approaches. The most effective agent systems I’ve worked with provide clear explanations of their decision-making process — not just what they did, but why they chose that approach over alternatives.

Security Headaches: New Attack Surfaces

Giving agents access to systems creates new security considerations. A colleague of mine built an agent to help manage their AWS infrastructure. It was incredibly useful until it accidentally exposed sensitive configuration details in logs because it didn’t understand the security implications.

Agents often need broad access privileges to be useful, but this creates potential security vulnerabilities. Careful permission design, monitoring systems, and appropriate guardrails are essential — especially when agents interact with critical systems.

The Responsibility Question: Who’s Accountable?

When your automated trading agent made a series of questionable trades that lost money, the question immediately arose: who’s responsible? The developer who built it? You who deployed it? The company that created the underlying AI model?

As agents take more autonomous actions in the world, we need clearer frameworks for accountability. This isn’t just a legal question — it’s also about designing appropriate human oversight and intervention mechanisms that preserve the efficiency benefits of automation while maintaining appropriate control.

Conclusion

If you’re just starting to explore this world of AI agents, don’t be intimidated. Start small — maybe with a personal productivity agent or a code assistant. Watch how it works, learn its strengths and limitations, and gradually expand the tasks you entrust to it. Before you know it, you’ll be designing multi-agent systems to tackle complex workflows that previously required entire teams.

For those already building agents, consider the human-agent relationship carefully. The most successful implementations I’ve seen don’t aim to replace human workers, but rather to enhance their capabilities — handling routine tasks so that people can focus on creative problem-solving, strategic thinking, and interpersonal connections.

Whether you’re looking to build AI agents or just understand how they’ll impact your work, there’s no better time to dive in. The tools are becoming increasingly accessible, their capabilities more impressive, and their applications more diverse with each passing month.

Top 5 Open Source Vector Databases for 2025 (Milvus vs. Qdrant. vs Weaviate vs Faiss. etc.)

Fendy Feng — Fri, 20 Jun 2025 07:14:23 GMT

Introduction

Vector search, also known as vector similarity search, has quickly evolved from an experimental technology to a must-have component in many AI applications. As developers and technical leaders, we’re increasingly looking for ways to handle similarity-based queries that traditional databases simply weren’t designed to handle efficiently.

Whether you’re building a product recommendation system or implementing semantic search, the underlying challenge is the same: how do you efficiently find the “nearest neighbors” to a query vector in a potentially massive dataset? That’s where vector search engines come in.

The good news is that the open source community has stepped up with multiple high-quality options. The challenging part? Figuring out which one is right for your specific use case, technical requirements, and team expertise.

In this guide, we’ll walk through the most popular open-source vector search engines available today, compare their strengths and limitations, and provide practical insights to help you make an informed decision. We’ll cover everything from the technical foundations to specific implementation considerations, with a focus on real-world applications.

Understanding Vector Search: Core Concepts

Before diving into specific engines, let’s establish some shared understanding of what vector search actually involves.

What Are Vector Embeddings?

At its core, vector search relies on embedding data into vectors — essentially converting information (text, images, audio, or any other data type) into lists of floating-point numbers that capture semantic meaning. These vectors typically range from dozens to thousands of dimensions.

For example, a text embedding model might encode the sentence “The weather is nice today” into a 384-dimensional vector where semantically similar sentences like “It’s a beautiful day” would be positioned nearby in this high-dimensional space.

Vector Search vs. Traditional Search

Traditional search engines typically use inverted indices and exact keyword matching. Vector search, in contrast, measures the distance between vectors to find similar items, regardless of exact keyword overlap.

Consider these approaches:

Traditional keyword search matches “red leather jacket” with documents containing exactly those words. Vector search, however, can match “red leather jacket” with items that are conceptually similar, even if described as “scarlet biker coat” because it understands the semantic similarity rather than requiring exact term matches.

Key Performance Metrics

When evaluating vector search engines, several metrics matter:

Query speed is measured in milliseconds or queries per second (QPS), indicating how quickly results are returned. Recall represents the percentage of relevant results actually retrieved compared to what should have been retrieved. Index build time tells you how long it takes to create the search index, while memory usage reflects RAM requirements for both indexing and querying. Scalability refers to a system’s ability to handle increasing data volumes and query loads without experiencing performance degradation.

Understanding these fundamentals will help frame our exploration of the specific engines.

An Overview: Key Features of Top Vector Search Engines

Milvus

Milvus is the most popular open-source vector database with more than 35,000 stars on GitHub. It first appeared in 2019 and has since gained significant traction in the developer community. Created specifically to handle large-scale similarity searches, Milvus was designed from the ground up to address the unique challenges of vector data management.

Architecture and Technical Capabilities

Milvus uses a cloud-native architecture with separated storage and compute layers. Stateless query nodes handle search requests, storage nodes manage data persistence, and coordinator nodes handle cluster management. This separation allows Milvus to scale horizontally as data volumes and query loads increase — a critical consideration for production deployments.

The platform supports multiple index types, including HNSW (Hierarchical Navigable Small World), IVF (Inverted File), DiskANN, and others, providing developers with flexibility to optimize for different workloads. Milvus also offers hybrid search capabilities, combining vector similarity with scalar filtering and full-text search, which proves valuable when search needs to consider both semantic similarity and keyword matching, as well as metadata constraints.

Milvus supports multiple distance metrics, including Euclidean, Cosine, and Inner Product, making it adaptable to various embedding types and similarity definitions. Its storage architecture includes time travel capabilities, allowing point-in-time queries and backups.

Milvus can be used to build various types of AI applications, from demos running locally in Jupyter Notebooks to massive-scale Kubernetes clusters handling tens of billions of vectors. Currently, there are three Milvus deployment options: Milvus Lite, Milvus Standalone, and Milvus Distributed.

Performance Characteristics

In benchmarks, Milvus demonstrates query latency typically in single-digit milliseconds for million-scale datasets, making it suitable for real-time applications. The platform supports ANNS (Approximate Nearest Neighbor Search) algorithms that trade perfect recall for substantial speed improvements — an essential trade-off for practical applications.

Memory usage in Milvus is managed through disk-based storage with memory caching, allowing it to handle datasets larger than available RAM. This approach makes Milvus more cost-effective for large vector collections compared to purely in-memory solutions.

For most production workloads, Milvus strikes a balance between recall accuracy and query speed, with tunable parameters that allow for adjustments based on specific requirements. However, this flexibility comes with added complexity in configuration and optimization.

Migration Simplicity

A notable advantage of Milvus is the straightforward migration path from other vector databases. Through open-source migration tools like the Vector Transport Service (VTS) tool, moving data from other vector search engines to Milvus is simplified. This tool supports automated schema mapping, incremental data migration, and data validation during the transfer process. This makes Milvus particularly attractive for teams that have outgrown their current solution or want to standardize on a single platform.

That said, migration always involves some effort and risk, so thorough testing remains necessary, despite the use of these tools.

Zilliz Cloud: Fully Managed Milvus

While the open-source Milvus is powerful on its own, it requires local machines and engineering resources to deploy, operate, and maintain when building production-level applications. Zilliz, the engineering team behind Milvus, has created a fully managed Milvus on Zilliz Cloud, eliminating all the operational overhead for its customers so that they can invest more in creation and their business, rather than devoting all resources to infrastructure management.

This Zilliz Cloud service provides additional feature sets, simplified deployment and operations, automatic scaling and resource management, advanced security features, and SLA-backed reliability. The managed service also includes continuous updates and optimizations, eliminating the need for in-house expertise.

For teams focused on building applications rather than managing infrastructure, Zilliz Cloud provides a way to leverage Milvus without operational overhead.

Community and Ecosystem

The Milvus ecosystem has grown substantially, with an active GitHub repository that features regular releases. The project offers client SDKs for Python, Java, Go, and other languages, as well as integration with popular AI models and ML frameworks like LangChain and LlamaIndex. Additionally, it features a growing community forum and comprehensive documentation.

This ecosystem maturity reduces implementation risks and provides multiple resources for troubleshooting. However, like any open-source project, community support can sometimes be unpredictable compared to paid support options.

Faiss

Faiss, short for Facebook AI Similarity Search, is a popular vector search library which was developed and open-sourced by Facebook AI Research (now Meta) in 2017. Unlike some other options in this comparison, Faiss was created by researchers for researchers, initially focusing on academic and experimental workloads before being adopted for production systems.

Technical Overview

Faiss takes a different approach from some other vector search solutions. It’s implemented in C++ with Python bindings for performance and designed as a library rather than a standalone service. One distinguishing feature is its optimization for both CPU and GPU execution, with certain workloads seeing dramatic speedups on GPU hardware.

The library offers multiple index types tailored for various scenarios. IndexFlatL2 offers exact search with L2 distance for perfect accuracy. IndexIVFFlat implements an inverted file with flat storage for improved query speed. IndexHNSW leverages Hierarchical Navigable Small World graphs for efficient approximate search. IndexPQ utilizes product quantization for memory efficiency, allowing even modest hardware to search billions of vectors.

Strengths and Limitations

One of Faiss’s major strengths is raw performance. It’s often the fastest option for in-memory vector search when properly configured. The library achieves memory efficiency through clever compression techniques, such as product quantization, which can reduce vector storage requirements by an order of magnitude.

Faiss also stands out with native GPU support for even faster processing, making it ideal for research environments with access to GPU resources. The library offers fine-grained control with detailed parameter tuning options for those who want to optimize their workloads.

However, Faiss comes with notable limitations. It has no built-in persistence layer, meaning developers must handle saving and loading indexes themselves. It requires more integration work than turnkey solutions since it’s a library rather than a service. Faiss is also less suited for distributed deployments without additional engineering work. So, many developers use Faiss for experimenting or prototyping.

Perhaps most significantly, Faiss has a steeper learning curve than some alternatives. The documentation, while comprehensive, assumes a strong understanding of the underlying algorithms and techniques.

Annoy

Annoy, which stands for “Approximate Nearest Neighbors Oh Yeah,” was developed by Spotify and open-sourced in 2013, making it one of the older solutions in this comparison. Created specifically to power Spotify’s music recommendation system, Annoy takes a distinct approach optimized for read-heavy workloads with relatively static data.

Approximate Nearest Neighbors Approach

Annoy uses random projection binary search trees as its core algorithm. Each tree splits the vector space differently, creating a forest of trees that collectively provide good approximations of the true nearest neighbors. As more trees are added to the forest, the probability of finding the true nearest neighbors increases, allowing a trade-off between accuracy and resource usage.

This approach differs significantly from the graph-based methods used by many newer vector search engines.

Performance Trade-offs

Annoy makes specific trade-offs that distinguish it from more general-purpose solutions. It’s read-optimized, delivering very fast performance at query time, but this comes at the cost of write flexibility. Once built, Annoy indexes don’t change — new data requires rebuilding the index.

The system is disk-based, with indexes that can be memory-mapped for efficiency. This allows Annoy to handle datasets larger than available RAM while maintaining good query performance. However, Annoy offers limited functionality beyond core approximate nearest neighbor search, lacking many features found in more comprehensive solutions.

These design choices make Annoy different from databases designed for frequent updates and complex queries.

Integration Options

Annoy offers Python bindings with scikit-learn compatibility, making it accessible to data scientists and ML engineers. Its C++ core provides good performance despite the simplified API. The library supports easy serialization and deserialization of indexes, facilitating offline build processes.

The API is simple and focused exclusively on nearest neighbor search, making it easy to learn but limited in functionality. Unlike more comprehensive vector databases, Annoy requires additional infrastructure for features like persistence, scaling, and query filtering.

Weaviate

Weaviate emerged in 2019 as a different approach to vector search. Unlike pure vector databases, Weaviate combines vector search capabilities with a knowledge graph, creating a hybrid system designed to add contextual understanding to similarity queries.

What sets Weaviate apart is its graph-based data model. In Weaviate, data objects can be connected through semantic relationships, and these connections add context to vector-based queries. This allows queries to blend vector similarity with graph traversal, supporting more sophisticated searches than simple nearest-neighbor matching. For instance, a deployment might store product embeddings and also model relationships between products, categories, and brands. A user query could then return not only similar items but also those connected through shared attributes or behaviors.

This hybrid model enables expressive querying, but it also introduces additional complexity in data modeling and indexing. Developers must manage both vector embeddings and graph relationships, which can increase the learning curve and operational overhead.

Weaviate uses HNSW-based indexing for efficient vector search and supports flexible filtering applied either pre- or post-search. It scales through sharding, allowing it to handle growing datasets and query loads. However, distributed setups can become more complex to configure and operate, particularly at larger scales.

While Weaviate performs well across a variety of use cases, it’s not always the top performer in pure vector search benchmarks. Its additional graph features, while powerful, can lead to slower response times when executing complex queries that combine vector search with multiple relationship traversals. This makes it better suited to applications that benefit from contextual enrichment, rather than those requiring ultra-low latency on high-throughput vector-only workloads.

Qdrant

Qdrant (pronounced “quadrant”) is a newer entrant to the vector database space, first appearing in 2021. Qdrant provides both REST and gRPC APIs for interacting with the database, making it accessible from virtually any programming language. Its storage is isolated in collections, similar to tables in traditional databases, providing logical separation of different data types. The architecture offers point-in-time consistency guarantees and ACID-compliant operations for data reliability. This approach makes Qdrant more familiar to developers coming from traditional database backgrounds, reducing the learning curve.

A key strength of Qdrant is its ability to combine vector search with traditional filtering. The platform offers rich filter expressions that execute efficiently as part of the search process. Its payload-based filtering integrates directly into the search rather than being applied as a post-processing step. It also supports complex boolean conditions, including AND, OR, and NOT operations across multiple fields, and allows boosting results based on specific filter conditions — useful for nuanced ranking in hybrid search.

However, this filtering flexibility comes with trade-offs. As filter expressions become more complex or datasets grow, query performance may degrade, particularly when many filters are applied in high-cardinality fields. Additionally, while Qdrant supports distributed deployments, its horizontal scaling features are still evolving compared to more mature systems, and operational tooling around large-scale clustering remains relatively limited. These factors should be considered when evaluating Qdrant for high-scale or highly dynamic workloads.

Other Notable Vector Search Options

Beyond the main purpose built options highlighted above, many traditional databases start to offer vector search capability as an add-on.

Elasticsearch with Vector Search

Elasticsearch, already widely adopted for text search, has added vector search capabilities in recent versions. This functionality introduces kNN (k-Nearest Neighbors) search to the Elasticsearch ecosystem, enabling organizations to utilize their existing infrastructure for vector search requirements.

The integration with existing Elasticsearch features enables teams to combine traditional text search, faceting, and aggregations with vector similarity on a single platform. The familiar API reduces the learning curve for teams already using Elasticsearch.

This approach works well for organizations already invested in the Elastic ecosystem who need to add vector capabilities without adopting an entirely new database. However, performance may not match purpose-built vector databases for large-scale, vector-only workloads.

Vespa

Vespa is Yahoo’s open source search engine that combines traditional search, vector search, and sophisticated ranking in a single platform. It offers real-time indexing and searching, with updates immediately available for query, unlike some solutions that require batch processing or index rebuilding.

The platform provides sophisticated ranking frameworks that can combine multiple signals, including vector similarity, text relevance, and business rules. It scales to large deployments with a distributed architecture and has been battle-tested in production at major internet companies.

Vespa’s comprehensive feature set makes it suitable for complex search applications, though this comes with increased complexity compared to more focused solutions. It requires more resources to deploy and maintain than simpler vector search options.

pgvector

pgvector is an extension that adds vector data types and operations to PostgreSQL, allowing vector search within a traditional relational database. It supports multiple index types including IVF and HNSW for efficient similarity search on vector columns.

The key advantage is the ability to use SQL queries combining vector and relational data, making it easy to add vector search to existing applications without adopting a separate database. This option leverages existing PostgreSQL infrastructure and expertise, potentially reducing operational overhead.

The main limitation is that performance may not match dedicated vector databases for very large vector collections or high query volumes. It represents a pragmatic compromise rather than an optimized solution for vector-only workloads. What is most important, does SQL really necessary for AI workloads in the future?

Emerging Options

The vector database space continues to evolve with newer projects entering the field. Chroma focuses specifically on embeddings for LLM applications, with simplified APIs for RAG implementations. Marqo emphasizes simplicity and cloud-native operations, aiming to reduce the operational burden of vector search. LanceDB offers embedded vector search capabilities, targeting edge devices and applications that need to operate offline.

These emerging options show the continued innovation in the space, though they generally lack the production history and ecosystem maturity of more established solutions.

Choosing the Right Vector Search Engine

With so many options available, selecting the right vector search engine requires careful consideration of your specific needs and constraints.

Decision Framework

When evaluating vector search engines, start by considering your scale requirements — how many vectors will you store and query, both now and in the future? Different engines have different scaling characteristics and sweet spots.

Next, assess your query patterns. Will you perform pure vector search, or do you need to combine vector similarity with filtering, relationship traversal, or other operations? Some engines excel at pure vector search but struggle with complex hybrid queries.

Update frequency is another important consideration. If your data changes frequently or requires real-time updates, solutions like Annoy that require rebuilding indexes will be problematic. Conversely, if your data is relatively static, simpler architectures may offer performance advantages.

Integration needs matter as well. Do you need a standalone service, a library to embed in your application, or an extension to an existing database? Your current infrastructure and team expertise may make certain options more practical than others.

Finally, consider your team’s expertise with specific technologies. The best technical solution on paper may not be the best choice if your team lacks the skills to implement and maintain it effectively.

Scaling Considerations

Different engines approach scaling in different ways, and understanding these differences is crucial for achieving long-term success. Milvus offers horizontal scaling with separated storage and compute, allowing independent scaling of different components as needs change. Faiss excels at vertical scaling, particularly with GPU acceleration, but requires more custom work for distributed deployments.

Your anticipated growth trajectory should influence your choice, with some solutions better suited to gradual scaling while others may require significant re-architecture as you grow.

Total Cost of Ownership

When selecting a vector search engine, consider all aspects of total cost of ownership. Infrastructure costs include RAM and CPU requirements, which vary significantly between solutions. Some engines require substantial memory for optimal performance, while others can operate effectively with more modest resources.

Operational complexity affects ongoing maintenance costs. Deployment, monitoring, and maintenance effort varies widely, with some solutions requiring specialized expertise while others integrate more easily with standard DevOps practices.

Development time is another important factor. The learning curve and integration complexity of different engines can significantly impact project timelines and success rates. Solutions with better documentation, more examples, and more intuitive APIs typically result in faster implementation.

Support options range from community forums to commercial support agreements. Consider your organization’s requirements for response times and support guarantees when evaluating options.

Finally, consider potential migration costs. If your needs change, how difficult would it be to switch to a different solution? Engines with standard APIs and export capabilities provide more future flexibility.

Future-Proofing

Vector search technology is evolving rapidly; therefore, selecting a solution that can adapt to your changing needs is crucial. Examine community activity and release cadence to assess ongoing development. Projects with regular updates and active discussion forums are more likely to remain relevant and up-to-date.

Corporate backing and sustainability matter for long-term viability. Projects supported by established companies or foundations generally have more stable development trajectories.

Aligning the feature roadmap with your anticipated needs helps ensure the solution grows in directions that benefit your use cases. Finally, flexibility to adapt as requirements change provides insurance against unexpected shifts in project requirements.

Benchmarking with Real-world Workloads

Benchmark results are often the first thing teams look at when comparing vector search engines, but many published benchmarks fail to reflect real-world usage. Synthetic tests tend to focus on idealized conditions — fixed datasets, uniform queries, and read-heavy workloads — while ignoring the complexities of real applications. In production, your system may need to support frequent updates, concurrent queries, multi-modal filtering, and hybrid search across structured and unstructured data. These challenges can drastically affect actual performance, scalability, and reliability.

To make an informed choice, prioritize benchmarks that replicate your expected workload patterns as closely as possible. Testing with real datasets, realistic query volumes, and operational constraints will provide a more accurate picture of how a vector search engine performs in your environment.

VDBBench is an open-source benchmark designed from the ground up to simulate production reality. Unlike synthetic tests that cherry-pick scenarios, VDBBench pushes databases through continuous ingestion, rigorous filtering conditions, and diverse scenarios, just like your actual production workloads.

VDBBench GitHub: https://github.com/zilliztech/VectorDBBench.

Conclusion and Next Steps

Vector search has moved beyond niche applications to become a fundamental building block for many modern applications. The open source ecosystem offers multiple strong options, each with distinct advantages and trade-offs.

For most teams just starting with vector search, Milvus provides a good balance of features, performance, and operational simplicity. Its comprehensive functionality and growing ecosystem make it suitable for a wide range of use cases, while fully managed options like Zilliz Cloud reduce operational overhead.

For specific needs, alternatives like Faiss (performance-focused), Weaviate (knowledge graph integration), Qdrant (filtering capabilities), or Annoy (read-optimized workloads) may be better fits.

Whatever you choose, start small, benchmark thoroughly against your specific workload, and validate assumptions before committing to a production deployment. Vector search technology continues to evolve rapidly, so staying engaged with the community around your chosen solution is essential for long-term success.

Ready to get started? Most of these projects offer excellent quickstart guides, Docker containers for easy experimentation, and active communities eager to help newcomers. The best way to evaluate is to build a small proof of concept with your actual data and query patterns.

Happy searching!

Why Not All VectorDBs Are Agent-Ready

Fendy Feng — Fri, 20 Jun 2025 06:59:13 GMT

Your AI agent just crushed another demo. Investors are impressed, users love the experience, and your team is riding high. But lurking beneath that success is a ticking time bomb: the infrastructure choice you made three months ago when you just needed something that worked.

Sound familiar? We’ve seen this story dozens of times — brilliant agents built on infrastructure that crumbles under success. The root cause is almost always the same: vector database choice. As the backbone of AI agent memory, it’s where most teams unknowingly sabotage their own scaling potential.

And choosing the right one just got a lot harder. Since AI exploded, every database vendor suddenly decided they’re a “vector database.” It’s like watching pizza shops declare themselves five-star restaurants because they added truffle oil to the menu.

Sure, these solutions work great for your 10,000-vector prototype. But when you hit 100 million vectors with thousands of concurrent users in production? That’s when reality hits hard.

Four Types of “VectorDBs”: Only One Works for Production AI Agents

The landscape can be broken down into four approaches. Three will have you rebuilding everything when success arrives. One is built for the scale you’re trying to reach.

Vector Search Libraries: FAISS and HNSWLIB deliver great benchmarks but barely have production features. No persistence means the server restarts and wipes your agent’s memory. No concurrency support creates race conditions with multiple users. No real-time updates mean index rebuilds can take hours, freezing your agent’s learning. Great for research, terrible for production.

Traditional Databases with Vector Add-ons: PostgreSQL + pgvector seems sensible until you realize you’re forcing vector operations through systems designed for completely different workloads. They work fine at 1 million vectors if there are few changes (ie, the index stays the same), but degrade unpredictably in performance when handling more dynamic workloads or with concurrent users. Elasticsearch has similar issues — vector operations get wrapped in query DSL designed for text search, creating performance overhead that compounds with complex agent queries. These solutions treat vectors as secondary features, not core capabilities.

Lightweight Vector Solutions: Light solutions like Chroma optimize for convenience over scale. Setup takes minutes, and APIs are clean, but they hit scaling walls around hundreds of thousands of vectors. When your agent gains traction, architectural limitations force expensive migrations just when success arrives.

Purpose-Built Vector Databases: Then there are databases like Milvus, designed from the ground up for real-world vector operations at scale. Every component — storage engines, query optimizers, network protocols — is architected specifically for similarity search and production AI agent workloads.

What Production Agents Actually Demand

You might be thinking: “Come on, how bad can it really be? PostgreSQL handles millions of rows just fine, and my prototype works great.” I get the skepticism — every database vendor promises their solution scales, and frankly, most work adequately for basic similarity search.

But here’s what changes everything: production AI agents don’t just do basic similarity search. They need complex operations under real-world constraints that expose the fundamental limitations of retrofitted solutions.

Exponential scaling math: When your ProductHunt feature drives 10x overnight growth, your vector index built for 100,000 embeddings now faces 10 million. Traditional databases like PostgreSQL+pgvector started doing full table scans because their indexing wasn’t designed for high-dimensional vector density. Query times jump from 50ms to 5+ seconds as similarity search complexity scales exponentially with both data volume and concurrent access.

The 100ms hybrid search reality: Your customer service agent needs to execute queries like “Find billing discussions for this customer, excluding resolved issues, similar to the current complaint, prioritizing the last 30 days.” That’s semantic similarity combined with metadata filtering, temporal constraints, and business logic — all in under 100ms, or the conversation feels broken. Most vector databases force you to choose between speed and complexity.

Multi-tenant data isolation: In a multi-tenant situation, Customer A’s 10,000 documents and Customer B’s 10 million both need consistent sub-second performance with zero data leakage, not just for privacy, but for regulatory compliance. Simple partitioning creates “noisy neighbor” problems where large customers degrade everyone’s performance. You need database-level isolation that maintains predictable performance characteristics.

Global compliance without compromise: GDPR requires EU data to stay in European data centers, while Chinese regulations mandate local residency. Yet your agents need unified access to global knowledge bases. Your infrastructure must support federated search across regions while maintaining strict data locality, comprehensive audit trails, and real-time updates — all without performance degradation.

Why Open-Source Milvus Solves What Others Can’t

Given these demanding production requirements, let’s talk about what actually works.

Milvus is an open-source vector database purpose-built from the ground up for scalable vector and AI search workloads. While other approaches struggle with the exponential scaling math, 100ms hybrid search reality, multi-tenant isolation, and global compliance demands we just outlined, Milvus treats these as core design requirements rather than afterthoughts.

Here’s what Milvus delivers for production agents:

True Horizontal Scaling at billion scale: Add capacity by adding nodes, not rewriting architecture. Proven on billions of vectors with consistent performance.
Native and flexible Multi-Tenancy: database-level, collection-level, and partition-level isolation with predictable performance, eliminating the workarounds that plague other solutions.
Hybrid Search Excellence: Semantic similarity, metadata filtering, and keyword search in unified queries — no separate systems to maintain.
Real-Time Agent memory: Continuous updates without index rebuilding delays or performance dead zones.
Open Source Foundation: Complete transparency, no vendor lock-in, and a community of thousands contributing to your success.

With over 35,000 GitHub stars and adoption by thousands of production AI systems, it’s proven where others promise.

Milvus 2.6 is available now, delivering dozens of breakthrough innovations across cost reduction, advanced search capabilities, and architectural enhancements built for massive scale. Explore all the details in this launch blog, or join our webinar with James Luan, VP of Engineering at Zilliz, for an exclusive deep dive into what’s new in this release.

For Startups Who Want to Build, Not Babysit — Try Zilliz Cloud

Well, I know that even the best open-source database requires engineering resources you probably don’t have. Your team should be building agent features that users love, not wrestling with Kubernetes clusters and database optimization.

That’s where Zilliz Cloud wants to help. Built by the original Milvus creators and optimized for production AI workloads, it delivers all the best of Milvus with zero operational burden, plus advanced enterprise features that would take your team months to implement.

Deploy in Minutes, Scale Automatically: One-click deployments with intelligent elastic scaling that automatically adapts to your agent’s usage patterns and traffic spikes.
Serverless Cost Optimization: Pay only for what you use with serverless scaling that automatically adjusts to your agent workload patterns. Many customers save 50% or more compared to alternatives, while also enjoying better performance and reliability.
Natural Language Query Interface: New MCP server support enables your agents to interact with their memory using natural language, such as “Find documents similar to our last conversation about pricing,” rather than complex query languages and API calls.
99.95% Uptime SLA: Your agents stay online, your customers stay happy, and you focus on building breakthrough features instead of debugging infrastructure failures.
Enterprise-Grade Security: SOC2 Type II and ISO27001 certified with comprehensive Role-Based Access Control and BYOC. Your enterprise customers’ compliance requirements are handled from day one, not bolted on later.
Global Scale, Local Performance: Available on AWS, Azure, and GCP across various regions worldwide, ensuring sub-100ms latency wherever your users are located.

Most importantly, you get direct support from the engineers who understand vector databases at the architectural level. When complex challenges arise, you’re working with the team that solved these problems at scale, not posting on forums hoping for community help.

Your Choice Determines Everything

The vector database you choose today determines whether your AI agents scale gracefully or crash when success arrives. As agent capabilities become table stakes, winners will be those who build on production-ready infrastructure while competitors debug scaling issues.

With Milvus, you get the performance, scalability, and flexibility of the leading open-source vector database — ideal for teams that want full control and customization for high-performance AI and vector search workloads. With Zilliz Cloud, you get a fully managed experience that includes hassle-free deployment, autoscaling, advanced enterprise features, built-in security, and compliance, allowing you to go to production faster with confidence.

We’ve guided hundreds of AI companies through this critical decision. For example, we helped Rexera scale its real estate AI agents to handle millions of property listings with sub-50ms hybrid search, seamlessly combining semantic similarity with complex filtering that traditional solutions couldn’t manage. We enabled Verbaflo.ai to serve millions of users with ultra-low latency and strict multi-tenancy that other vector databases simply couldn’t deliver at scale. And we partnered with Fivevine to modernize their AI infrastructure, setting the foundation for the next wave of innovation. The right choice today will set the stage for your success tomorrow.

Ready to Handle Real Growth?

Ready to build agents that scale beyond demos? Try Zilliz Cloud free or reach out to us to see what purpose-built vector infrastructure can do for your AI agents.

And yes, we can help you migrate from Pinecone, Weaviate, pgvector, or any other platform you’re struggling with right now. Whatever you’re paying now, we can likely do it for half the cost, with better performance.

Our vision extends beyond providing infrastructure — we want to help AI startups become the next AI giants. Let’s build for the future together.

Why AI Agent Startups Should Build Scalable Infrastructure From Day One

Fendy Feng — Fri, 20 Jun 2025 03:29:17 GMT

We’re living in the golden age of AI, where small teams are making massive impacts. Cursor hit $100M in ARR with just 20 people. Sakana AI reached a $67M valuation per employee, with only 3 founders. Midjourney scaled to $200M ARR without raising a dime in equity.

In this new era, this same dream of massive impact with small teams is within every developer’s reach. Be it an AI assistant, a customer support agent, or a personalized tutor. Whatever the use case, every AI application today has the potential to go viral overnight.

One perfectly timed product launch, a tweet from the right influencer, or a 30-second demo video can propel your app to the top of Hacker News or Product Hunt. Suddenly, you have tens of thousands of users flooding in.

And that’s when the real test begins: Can your infrastructure handle the exponential growth?

Most AI agents are built to validate ideas quickly, not to scale robustly. When viral growth hits — and in the AI agent space, it hits fast and ruthlessly — inadequate infrastructure becomes the quicksand that swallows your breakthrough moment whole.

The Real Bottleneck of Your Agent Isn’t Your LLM — It’s Your Memory Architecture

Here’s something that will change how you think about building AI agents.

As a developer, you know that every production AI agent is built on three core components:

LLM — Your reasoning engine that makes decisions and gives instructions
Tool Use — API integrations and external system access to complete real-world tasks
Memory/Retriever — Context retrieval and knowledge management powered by vector databases

When building agents, developers naturally focus on getting the LLM integration right and setting up proper tool use. Of course, they are absolutely essential. You need solid reasoning capabilities and the ability to take meaningful actions in the real world.

But here’s what’s happening in the market: LLM capabilities across providers have become remarkably commoditized. Whether you choose Claude, OpenAI, or open-source alternatives, the reasoning quality for most agent use cases is now virtually indistinguishable. Tool use has also standardized — MCP, function calling, and agent frameworks work consistently across platforms.

When evaluating your agent, end customers don’t care about what model or framework runs under the hood. They care about the experience: Is your agent lightning-fast and responsive? Does it truly understand their needs and context? Can it remember previous conversations and instantly find exactly the right information when they need it?

This is why the infrastructure powering your agent memory is critical. The vector database behind the scenes determines whether your agent can handle real-world demands: retrieving accurate documents in milliseconds across millions of records, supporting millions of active users with multi-tenancy, and scaling seamlessly when growth accelerates from zero to viral overnight.

The Hidden Costs When Agent Developers Choose Wrong

This is the story every AI agent startup founder fears — and some have already experienced it.

We recently worked with a team whose conversational AI agent was thriving, handling thousands of conversations daily and growing steadily month over month. Their system ran on a lightweight vector database that supported a fairly complex retrieval business logic. Everything worked beautifully — until it needed to scale.

As their user base surged and requests climbed into the millions, the system hit a wall. Query times slowed from milliseconds to seconds, then to tens of seconds, causing customers to leave the platform. Lacking advanced features like metadata filtering and hybrid search, more experienced customers are unhappy with the answer quality. To make matters worse, the database offered limited partitioning, making data isolation unreliable.

This is the hidden cost of infrastructure shortcuts: when success comes, wrong choices become expensive disasters.

When AI agent teams choose the wrong vector database, they don’t just hit technical limitations — they accumulate infrastructure debt that kills their agent’s potential at the worst possible moment:

Migration Complexity: Moving between databases isn’t easy. Different systems use incompatible indexing methods, data formats, and query languages. Teams often need to spend months rewriting core agent functionality.
Multi-Tenancy Challenges: Enterprise customers require strict data separation between tenants, but it’s difficult to add this security to databases that weren’t originally built for multiple tenants. A difficult choice between operational complexity and degraded customer experience or even compliance issues is presented to developers.
Search Quality Pain: Some vector databases lack full-text search support or performant metadata filtering. Without those backing up your retrieval pipeline, your agent gets stuck being “smart enough,” while competitors ship better search experiences.
The Cost of Missing Your Moment: The most devastating cost is watching your breakthrough moment slip away while you’re stuck debugging infrastructure. Your perfect product-market fit might arrive tomorrow — will your infrastructure be ready to handle success, or will you watch helplessly as the opportunity disappears forever?

Milvus: an Open-Source Vector Database Built to Power the Future

We understand that many developers feel overwhelmed when researching vector databases. The market is filled with dazzling benchmarks, biased recommendations, and demo-friendly solutions that perform well in testing but fail in production.

Milvus, an open-source vector database with 35K+ stars on GitHub and backing from the world’s largest AI companies, takes a different approach. Milvus provides multiple options for deployment for different use cases and environments. One API, infinite deployment flexibility: Developers can start with Milvus Lite for rapid experimentation and prototyping, deploy Standalone for production workloads, scale to Cluster for distributed applications handling billions of vectors — all without changing a single line of code.

But scalability is just the foundation. Milvus provides a lot of advanced capabilities that make your agent genuinely intelligent in real-world deployments:

Production-Grade Multi-Tenancy: Robust tenant isolation that works at billion-vector scale. Whether you’re serving 10 pilot customers or 10,000 enterprise accounts, each gets complete data separation with unified, predictable performance.
Billions-Scale Distributed Architecture: True linear scaling from thousands to billions of vectors across multiple nodes and data centers. When viral growth hits and your user base explodes overnight, add capacity by adding nodes — no expensive hardware upgrades, no architectural rewrites, no downtime.
Hybrid Search Excellence: Production AI agents need queries that combine semantic similarity with business logic, temporal constraints, and metadata filtering. Execute complex operations like “Find pricing documents John accessed in the last two weeks, mentioning API rate limits with sentiment analysis scores above 0.8” in a single, lightning-fast operation.
Real-Time Agent Memory: Streaming ingestion with immediate consistency means your agent incorporates new information instantly without rebuilding indexes or batch processing delays. When a user provides feedback or uploads a document, your agent knows about it immediately.

We just rolled out Milvus 2.6, delivering dozens of breakthrough innovations across cost reduction, advanced search capabilities, and architectural enhancements built for massive scale. Explore all the details in our launch blog, or join our webinar with James Luan, VP of Engineering at Zilliz, for an exclusive deep dive into what’s new in this release.

If You Want Zero Hassle — Try Zilliz Cloud

Milvus is completely open source and free to use forever. But if you’re a startup that values innovation over managing Kubernetes clusters and database optimization, we strongly recommend Zilliz Cloud, the fully managed service of Milvus built by the original Milvus team.

With Zilliz Cloud, you get all the best of Milvus as well as advanced enterprise-grade features without the operational overhead:

Deploy in Minutes, Scale Automatically: One-click deployments with intelligent elastic scaling that automatically adapts to your agent’s usage patterns and traffic spikes.
Cost Optimization: Pay only for what you use with serverless scaling that automatically adjusts to your agent workload patterns. Many customers save 50% or more compared to alternatives, while also enjoying better performance and reliability.
Natural Language Query Interface: New MCP server support lets your agents interact with their memory using natural language: “Find documents similar to our last conversation about pricing” instead of complex query languages and API calls.
99.95% Uptime SLA: Your agents stay online, your customers stay happy, and you focus on building breakthrough features instead of debugging infrastructure failures. We handle the operational complexity so you can focus on what makes your agent special.
Enterprise-Grade Security by Default: SOC2 Type II and ISO27001 certified with comprehensive Role-Based Access Control and BYOC. Your enterprise customers’ compliance requirements are handled from day one, not bolted on later.
Global Scale, Local Performance: Available on AWS, Azure, and GCP across various regions worldwide, ensuring sub-100ms latency wherever your users are located. Your agent feels fast whether accessed from Silicon Valley or Singapore.

For any company focused on AI innovation, technical teams should spend their time on application breakthroughs and customer value creation, not on the complex and tedious operational work of database management. Leave the infrastructure complexity to us and truly liberate your team’s productivity and creativity to build the future.

Ready to Scale with Confidence?

If you’re building an AI agent, now is the time to think about infrastructure. Don’t let success catch you unprepared. Build on a stack that grows with you.

And yes, we can help you migrate from Pinecone, Weaviate, pgvector, or any other platform.

Whatever you’re paying now, we can likely do it for half the cost, with better performance.

Try Zilliz Cloud for free today or reach out to sales for more information.

Let’s build for the boom.