Your Agent Should Decide When to Use RAG
Most RAG tutorials assume every query needs retrieval.
That's a bad default.
In a real agentic system, retrieval is just one tool — and often the wrong one.
So instead of building another RAG pipeline, I added RAG to Narad — my LangGraph chatbot — as a tool the agent can choose to ignore.
That decision changed everything.
Where Narad Started
Narad is a multi-session chatbot built with LangGraph, Streamlit, and PostgreSQL.
If you want the full backstory, I've written about the initial build and adding tools in previous posts.
Before RAG, the architecture looked like this:
- A
StateGraphwith achat_node(GPT-4o with tool calling) - A
ToolNodefor execution tools_conditionfor routing- Tools: calculator, Tavily search, yFinance
Tools were defined once at startup and statically bound to the LLM.
The checkpointer persisted conversations to PostgreSQL, so threads survived restarts. The frontend streamed responses via Streamlit.
The system worked.
Narad is a simple agent — single LLM, multiple tools, autonomous routing. The model decides which tools to call, how many, and whether to call any at all. No hardcoded if-else chains.
The question was:
How do you add per-user document retrieval without turning the entire system into a RAG pipeline?
Choosing the Right Vector Store
I've used Pinecone before. My YouTube RAG project uses it with namespaces, hybrid search, and reranking.
So my first instinct was: use Pinecone again.
But that instinct was wrong.
The actual use case here is simple:
- User uploads a PDF
- Asks questions during the session
- Leaves
- Data is no longer needed
The embeddings are ephemeral by design.
Using Pinecone here would mean:
- Managing namespaces
- Cleaning up data
- Paying for storage you don't need
- Making API calls for every retrieval
All for data that is intentionally temporary.
FAISS, on the other hand:
- Lives in memory
- Dies on restart
- Requires zero cleanup
- Has zero network overhead
That's not a limitation here.
That's the feature.
Tradeoff: If the user refreshes, the index is gone and the PDF must be re-uploaded. For a session-scoped system, that's acceptable. For a production system with persistent documents, it's not.
Takeaway: Your vector store should match your data lifecycle. Persistent data → persistent store. Ephemeral data → in-memory is enough.
The Ingestion Pipeline (Not the Interesting Part)
The ingestion pipeline isn't the interesting part.
If you've built RAG before, you already know this flow:
- Upload PDF
- Chunk it
- Embed it
- Store vectors
- Create retriever
Here's the implementation:
def ingest_pdf(file_bytes: bytes, thread_id: str, filename: str = None) -> dict:
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
tmp.write(file_bytes)
tmp_path = tmp.name
try:
docs = PyPDFLoader(tmp_path).load()
chunks = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200
).split_documents(docs)
vector_store = FAISS.from_documents(chunks, embeddings)
retriever = vector_store.as_retriever(
search_type="similarity", search_kwargs={"k": 4}
)
_THREAD_RETRIEVERS[str(thread_id)] = retriever
_THREAD_METADATA[str(thread_id)] = {
"filename": filename,
"pages": len(docs),
"chunks": len(chunks),
}
return _THREAD_METADATA[str(thread_id)]
finally:
os.remove(tmp_path)
The important part isn't ingestion.
It's how this retriever integrates with the agent.
The Real Problem: Context Leakage
The RAG tool needs access to a thread-specific retriever.
Each user has a different FAISS index.
So how do you pass that into the tool?
Naive approach
@tool
def rag_tool(query: str, thread_id: str) -> dict:
retriever = _THREAD_RETRIEVERS.get(thread_id)
# ...
This pushes responsibility onto the LLM.
Which is a mistake.
You're asking the model to:
- Remember a UUID
- Pass it correctly
- Not hallucinate it
LLMs are bad at plumbing.
The Fix: Closures
The solution is to remove the problem from the LLM entirely.
def make_rag_tool(thread_id: str):
retriever = _THREAD_RETRIEVERS.get(str(thread_id))
@tool
def rag_tool(query: str) -> dict:
"""
Retrieve relevant information from the uploaded PDF document.
Use this when the user asks questions about their uploaded document.
"""
if retriever is None:
return {"error": "No PDF uploaded."}
results = retriever.invoke(query)
return {
"query": query,
"context": [doc.page_content for doc in results],
"metadata": [doc.metadata for doc in results],
}
return rag_tool
The key idea:
Don't let the LLM manage context it shouldn't even see.
Capture it in a closure instead.
The model only sees rag_tool(query: str). It has no idea there's a thread-specific vector store wired in behind the scenes.
Dynamic Tool Binding
Before RAG, the tools list was static:
tools = [calculator, search, get_stock_info]
llm.bind_tools(tools)
Now it's dynamic:
def get_tools_for_thread(thread_id: str):
tools = base_tools.copy()
if thread_id in _THREAD_RETRIEVERS:
tools.append(make_rag_tool(thread_id))
return tools
And inside the node:
tools = get_tools_for_thread(thread_id)
llm.bind_tools(tools)
This is the moment Narad stopped being a chatbot.
And started becoming an agent.
Tools are no longer static. They are part of runtime state.
Multi-Tool Orchestration
This was the first moment the system felt different.
Query:
"Explain KNN from the document and tell me the next Liverpool match."
What happened:
rag_tool→ handled KNN from PDFsearch→ fetched match schedule
Single query. Multiple tools. One response.
No routing logic. No special casing.
Just the model deciding what to do.
That's the difference between:
- a RAG pipeline
- an agent with tools
Edge Cases and Fixes
Building the happy path took a day. Handling the edge cases took longer.
No fallback after empty RAG results
RAG returned nothing. The agent stopped.
Fix (temporary): prompt instruction to fallback to search.
Reality: This is a hack.
Better solution: graph-level conditional routing. That's on the roadmap.
🔁 PDF persists across threads
Streamlit uploader didn't reset on new chat.
Fix:
key=f"pdf_uploader_{thread_id}"
FAISS disappears on refresh
Expected behavior.
In-memory store → process restart = data gone.
Tradeoff accepted.
Ghost responses during streaming
Tool status and stream output overlapped in UI.
Fix: separate rendering containers for tool status and streamed text.
What I'd Do Differently
1. Move fallback logic out of prompts
Prompt-based routing works — until it doesn't.
If this were production, I'd move this into the graph layer.
2. Decouple backend from Streamlit
Right now everything is coupled through session state.
A FastAPI backend would:
- Separate concerns
- Enable multiple clients
- Make the system production-ready
3. Use persistent vector store (if needed)
FAISS works because the use case is session-scoped.
For real users:
- Pinecone
- Qdrant
- User-scoped namespaces
What Changed
The biggest shift wasn't adding RAG.
It was making it optional.
In most tutorials, every query goes through retrieve → augment → generate.
In Narad, the agent decides when retrieval is worth doing — and when it's not.
That's the difference between demos and systems.
What's Next
Narad now supports:
- Calculator
- Web search
- Stock lookup
- Document Q&A
All orchestrated dynamically by the agent.
Next:
- Human-in-the-loop workflows (
interrupt()) - Confidence-based routing
- FastAPI backend
If you want to explore the code: GitHub
Try it live: narad-chat.onrender.com
*This is part 3 of the Narad series.