🧭 Overview
In this chapter, you’ll build a minimal Agentic RAG stack that can intelligently choose between two tools—vector database search and live web search—based on query type. You’ll learn how to:
Differentiate between classic RAG and Agentic RAG
Expose two MCP tools:
vector_ragandweb_searchImplement a rule-based client that routes queries dynamically
Estimated time: 10–15 minutes
Full source code: Github
🔍 Concept: From RAG to Agentic RAG
Imagine an LLM standing at a fork in the road: one path leads to a curated ML knowledge base, the other to the vast web. How does it choose?
That’s the essence of Agentic RAG—a retrieval-augmented generation system where the model doesn’t just retrieve context, it chooses how to retrieve it.
Classic RAG is a two-step process:
Retrieve relevant documents from a vector database using semantic similarity.
Generate an answer using those documents as context.
Agentic RAG adds decision-making:
The model chooses which tool to use (e.g., vector DB vs. web search).
It may chain multiple tools or retry based on feedback.
This unlocks more flexible and intelligent retrieval strategies.
🧰 Prerequisites
Install the minimal stack:
bash
pip install fastmcp chromadb duckduckgo-search sentence-transformers
💡 Tip: If you're offline, stub the web search tool to return a fixed string.
🧱 Step 1: Build a Tiny Vector Store
🧠 What’s a Vector Store?
A vector store is a database optimized for storing and searching high-dimensional vectors—numerical representations of text, images, or other data. In RAG systems, it’s used to:
Encode documents into embeddings (vectors)
Compare a query’s embedding to stored ones
Return the most semantically similar documents
This allows your model to “understand” meaning beyond keywords.
🛠️ What We’re Building
We’ll create a local vector store using ChromaDB and populate it with a few machine learning FAQ-style documents. These will be used to answer ML-related queries.
Create prepare_store.py:
import chromadb
from sentence_transformers import SentenceTransformer
import os
# List of ML FAQ documents to be embedded and stored in the vector DB
docs = [
"Gradient descent is an optimization algorithm used to minimize loss functions.",
"The bias-variance tradeoff explains model generalization.",
"Transformers use attention mechanisms to model long-range dependencies.",
"Qdrant is a popular open-source vector database.",
"Retrieval-Augmented Generation (RAG) augments LLMs with external context."
]
# Generate unique IDs for each document
ids = [f"doc-{i}" for i in range(len(docs))]
# Load the sentence transformer model for embedding the documents
model = SentenceTransformer("all-MiniLM-L6-v2")
# Set up the path for persistent vector storage inside the .venv directory
venv_dir = os.path.join(os.path.dirname(__file__), "..", ".venv")
rag_store_path = os.path.join(venv_dir, ".rag_store")
# Initialize a persistent ChromaDB client at the specified path
client = chromadb.PersistentClient(path=rag_store_path)
# Get or create a collection named "ml_faq" for storing the documents and embeddings
collection = client.get_or_create_collection("ml_faq")
# Convert documents to vector embeddings using the model
embeddings = model.encode(docs).tolist()
# Insert (or update) the documents, their IDs, and embeddings into the collection
collection.upsert(documents=docs, ids=ids, embeddings=embeddings)
# Print the number of documents in the collection to verify successful insertion
print(f"Number of documents in collection: {collection.count()}")
# Confirmation message showing where the vector store is persisted
print(f"✅ Vector store ready and persisted at {rag_store_path}")Run it:
python prepare_store.py🛠️ Step 2: Expose Two MCP Tools (Vector RAG + Web Search)
🧠 Why Two Tools?
We want our system to intelligently choose between:
vector_rag: for ML-specific queries answered from our local storeweb_search: for general queries that need fresh or broad context
This dual-tool setup is the foundation of Agentic RAG.
Create server.py:
import asyncio
import pprint
from fastmcp import Client
from fastmcp.client.transports import StreamableHttpTransport
SERVER_URL = "http://localhost:8000/mcp"
pp = pprint.PrettyPrinter(indent=2, width=100)
def is_ml_query(query: str) -> bool:
"""
Checks if the query is related to machine learning by searching for ML-specific keywords.
Returns True if any keyword is found in the query.
"""
ml_keywords = ["machine learning", "gradient", "transformer", "bias-variance", "qdrant", "RAG", "vector db"]
return any(k.lower() in query.lower() for k in ml_keywords)
async def main():
# Set up the transport and client for communicating with the MCP server
transport = StreamableHttpTransport(url=SERVER_URL)
client = Client(transport)
print("\n🚀 Connecting to FastMCP server at:", SERVER_URL)
async with client:
# 1. Ping to test connectivity
print("\n🔗 Testing server connectivity...")
await client.ping()
print("✅ Server is reachable!\n")
# 2. List server tool
print("🛠️ Available tools:")
tools = await client.list_tools() # Fetches the list of available tools from the server
pp.pprint(tools)
# 3. Get your query:
query = input("Your query: ") # User inputs their query
# 4. Call tool
ml_query = is_ml_query(query) # Determine if the query is ML-related
print(f"Machine learning related query? {'Yes' if ml_query else 'No'}")
# Extract tool names for routing logic
tool_names = [tool.name for tool in tools]
# Route the query to the appropriate tool based on its type and availability
if ml_query and "vector_rag" in tool_names:
print("🧠 Routing to vector_rag …")
result = await client.call_tool("vector_rag", {"query": query})
elif "web_search" in tool_names:
print("🌐 Routing to web_search …")
result = await client.call_tool("web_search", {"query": query})
else:
result = {"error": "No suitable tool found."}
# Pretty print the result from the tool
pp.pprint(result)
if __name__ == "__main__":
# Run the async main function
asyncio.run(main())
Serve it:
python server.py🧠 Why this matters: These tools are now callable by any MCP client or LLM host. You’ve just built a modular retrieval layer.
💻 Step 3: Write a Simple Agentic Client (Rule-Based Router)
🧠 What’s a Router?
A router is a lightweight decision engine. It inspects the query and chooses the best tool. In this example, we’ll use simple keyword matching to route queries.
Create client.py:
import requests
SERVER = "http://localhost:3333"
def list_tools():
return requests.get(f"{SERVER}/tools").json()
def call_tool(name, args):
return requests.post(f"{SERVER}/tools/call", json={"name": name, "arguments": args}).json()
def is_ml_query(query: str) -> bool:
ml_keywords = ["machine learning", "gradient", "transformer", "bias-variance", "qdrant", "RAG", "vector db"]
return any(k.lower() in query.lower() for k in ml_keywords)
def main():
query = input("Your query: ")
tools = [t["name"] for t in list_tools()]
print("📦 Available tools:", tools)
if is_ml_query(query) and "vector_rag" in tools:
print("🧠 Routing to vector_rag …")
result = call_tool("vector_rag", {"query": query})
elif "web_search" in tools:
print("🌐 Routing to web_search …")
result = call_tool("web_search", {"query": query})
else:
result = {"error": "No suitable tool found."}
print(result)
Run it:
python client.pyExample output:
🚀 Connecting to FastMCP server at: http://localhost:8000/mcp
🔗 Testing server connectivity...
✅ Server is reachable!
🛠️ Available tools:
[ Tool(name='vector_rag', title=None, description='Query the local ML FAQ vector DB and return top-k matches.', inputSchema={'properties': {'query': {'title': 'Query', 'type': 'string'}, 'k': {'default': 3, 'title': 'K', 'type': 'integer'}}, 'required': ['query'], 'type': 'object'}, outputSchema=None, annotations=None, meta={'_fastmcp': {'tags': []}}),
Tool(name='web_search', title=None, description='Run a web search for general queries.', inputSchema={'properties': {'query': {'title': 'Query', 'type': 'string'}, 'k': {'default': 3, 'title': 'K', 'type': 'integer'}}, 'required': ['query'], 'type': 'object'}, outputSchema=None, annotations=None, meta={'_fastmcp': {'tags': []}})]
Your query: What is gradient descent?
Machine learning related query? Yes
🧠 Routing to vector_rag …
CallToolResult(content=[ TextContent(type='text', text='{"source":"vector_db","query":"What is gradient descent?","results":[{"doc":"Gradient descent is an optimization algorithm used to minimize loss functions.","distance":0.20000019669532776},{"doc":"The bias-variance tradeoff explains model generalization.","distance":1.3403548002243042},{"doc":"Qdrant is a popular open-source vector database.","distance":1.5645623207092285}]}', annotations=None, meta=None)],
...🧠 Why this matters: This client mimics an LLM host making tool choices. You’ve just built a basic agentic system.
✅ Summary Checklist
You now have:
A working vector store for ML FAQs
Two MCP tools exposed for retrieval
A rule-based client that routes queries intelligently
Next steps could include:
Replacing the rule-based router with a small LLM
Adding fallback logic or retry strategies
Logging tool usage for analytics

