Your Agent Should Decide When to Use RAG

Most RAG tutorials assume every query needs retrieval.

That's a bad default.

In a real agentic system, retrieval is just one tool — and often the wrong one.

So instead of building another RAG pipeline, I added RAG to Narad — my LangGraph chatbot — as a tool the agent can choose to ignore.

That decision changed everything.

Where Narad Started

Narad is a multi-session chatbot built with LangGraph, Streamlit, and PostgreSQL.

If you want the full backstory, I've written about the initial build and adding tools in previous posts.

Before RAG, the architecture looked like this:

A StateGraph with a chat_node (GPT-4o with tool calling)
A ToolNode for execution
tools_condition for routing
Tools: calculator, Tavily search, yFinance

Tools were defined once at startup and statically bound to the LLM.

The checkpointer persisted conversations to PostgreSQL, so threads survived restarts. The frontend streamed responses via Streamlit.

The system worked.

Narad is a simple agent — single LLM, multiple tools, autonomous routing. The model decides which tools to call, how many, and whether to call any at all. No hardcoded if-else chains.

The question was:

How do you add per-user document retrieval without turning the entire system into a RAG pipeline?

Choosing the Right Vector Store

I've used Pinecone before. My YouTube RAG project uses it with namespaces, hybrid search, and reranking.

So my first instinct was: use Pinecone again.

But that instinct was wrong.

The actual use case here is simple:

User uploads a PDF
Asks questions during the session
Leaves
Data is no longer needed

The embeddings are ephemeral by design.

Using Pinecone here would mean:

Managing namespaces
Cleaning up data
Paying for storage you don't need
Making API calls for every retrieval

All for data that is intentionally temporary.

FAISS, on the other hand:

Lives in memory
Dies on restart
Requires zero cleanup
Has zero network overhead

That's not a limitation here.

That's the feature.

Tradeoff: If the user refreshes, the index is gone and the PDF must be re-uploaded. For a session-scoped system, that's acceptable. For a production system with persistent documents, it's not.

Takeaway: Your vector store should match your data lifecycle. Persistent data → persistent store. Ephemeral data → in-memory is enough.

The Ingestion Pipeline (Not the Interesting Part)

The ingestion pipeline isn't the interesting part.

If you've built RAG before, you already know this flow:

Upload PDF
Chunk it
Embed it
Store vectors
Create retriever

Here's the implementation:

def ingest_pdf(file_bytes: bytes, thread_id: str, filename: str = None) -> dict:
    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
        tmp.write(file_bytes)
        tmp_path = tmp.name

    try:
        docs = PyPDFLoader(tmp_path).load()
        chunks = RecursiveCharacterTextSplitter(
            chunk_size=1000, chunk_overlap=200
        ).split_documents(docs)

        vector_store = FAISS.from_documents(chunks, embeddings)
        retriever = vector_store.as_retriever(
            search_type="similarity", search_kwargs={"k": 4}
        )

        _THREAD_RETRIEVERS[str(thread_id)] = retriever
        _THREAD_METADATA[str(thread_id)] = {
            "filename": filename,
            "pages": len(docs),
            "chunks": len(chunks),
        }

        return _THREAD_METADATA[str(thread_id)]
    finally:
        os.remove(tmp_path)

The important part isn't ingestion.

It's how this retriever integrates with the agent.

The Real Problem: Context Leakage

The RAG tool needs access to a thread-specific retriever.

Each user has a different FAISS index.

So how do you pass that into the tool?

Naive approach

@tool
def rag_tool(query: str, thread_id: str) -> dict:
    retriever = _THREAD_RETRIEVERS.get(thread_id)
    # ...

This pushes responsibility onto the LLM.

Which is a mistake.

You're asking the model to:

Remember a UUID
Pass it correctly
Not hallucinate it

LLMs are bad at plumbing.

The Fix: Closures

The solution is to remove the problem from the LLM entirely.

def make_rag_tool(thread_id: str):
    retriever = _THREAD_RETRIEVERS.get(str(thread_id))

    @tool
    def rag_tool(query: str) -> dict:
        """
        Retrieve relevant information from the uploaded PDF document.
        Use this when the user asks questions about their uploaded document.
        """
        if retriever is None:
            return {"error": "No PDF uploaded."}

        results = retriever.invoke(query)
        return {
            "query": query,
            "context": [doc.page_content for doc in results],
            "metadata": [doc.metadata for doc in results],
        }

    return rag_tool

The key idea:

Don't let the LLM manage context it shouldn't even see.

Capture it in a closure instead.

The model only sees rag_tool(query: str). It has no idea there's a thread-specific vector store wired in behind the scenes.

Dynamic Tool Binding

Before RAG, the tools list was static:

tools = [calculator, search, get_stock_info]
llm.bind_tools(tools)

Now it's dynamic:

def get_tools_for_thread(thread_id: str):
    tools = base_tools.copy()
    if thread_id in _THREAD_RETRIEVERS:
        tools.append(make_rag_tool(thread_id))
    return tools

And inside the node:

tools = get_tools_for_thread(thread_id)
llm.bind_tools(tools)

This is the moment Narad stopped being a chatbot.

And started becoming an agent.

Tools are no longer static. They are part of runtime state.

Multi-Tool Orchestration

This was the first moment the system felt different.

Query:

"Explain KNN from the document and tell me the next Liverpool match."

What happened:

rag_tool → handled KNN from PDF
search → fetched match schedule

Single query. Multiple tools. One response.

No routing logic. No special casing.

Just the model deciding what to do.

That's the difference between:

a RAG pipeline
an agent with tools

Edge Cases and Fixes

Building the happy path took a day. Handling the edge cases took longer.

No fallback after empty RAG results

RAG returned nothing. The agent stopped.

Fix (temporary): prompt instruction to fallback to search.

Reality: This is a hack.

Better solution: graph-level conditional routing. That's on the roadmap.

🔁 PDF persists across threads

Streamlit uploader didn't reset on new chat.

Fix:

key=f"pdf_uploader_{thread_id}"

FAISS disappears on refresh

Expected behavior.

In-memory store → process restart = data gone.

Tradeoff accepted.

Ghost responses during streaming

Tool status and stream output overlapped in UI.

Fix: separate rendering containers for tool status and streamed text.

What I'd Do Differently

1. Move fallback logic out of prompts

Prompt-based routing works — until it doesn't.

If this were production, I'd move this into the graph layer.

2. Decouple backend from Streamlit

Right now everything is coupled through session state.

A FastAPI backend would:

Separate concerns
Enable multiple clients
Make the system production-ready

3. Use persistent vector store (if needed)

FAISS works because the use case is session-scoped.

For real users:

Pinecone
Qdrant
User-scoped namespaces

What Changed

The biggest shift wasn't adding RAG.

It was making it optional.

In most tutorials, every query goes through retrieve → augment → generate.

In Narad, the agent decides when retrieval is worth doing — and when it's not.

That's the difference between demos and systems.

What's Next

Narad now supports:

Calculator
Web search
Stock lookup
Document Q&A

All orchestrated dynamically by the agent.

Human-in-the-loop workflows (interrupt())
Confidence-based routing
FastAPI backend

If you want to explore the code: GitHub

Try it live: narad-chat.onrender.com

*This is part 3 of the Narad series.

Part 1: Initial Build

Part 2: tool calling

Your Agent Should Decide When to Use RAG

Where Narad Started

Choosing the Right Vector Store

The Ingestion Pipeline (Not the Interesting Part)

The Real Problem: Context Leakage

The Fix: Closures

Dynamic Tool Binding

Multi-Tool Orchestration

Edge Cases and Fixes

No fallback after empty RAG results

🔁 PDF persists across threads

FAISS disappears on refresh

Ghost responses during streaming

What I'd Do Differently

What Changed

What's Next

Comments

Building Narad — A Production LangGraph Chatbot

What Actually Happens When You Add Tools to a LangGraph Agent

More from this blog

I Evaluated My AI Agent. Three Decisions Were Wrong.

Building AI Agents That Know When Not to Answer

Building AI Agents That Know When NOt

What Actually Happens When You Add Tools to a LangGraph Agent

Command Palette

Where Narad Started

Choosing the Right Vector Store

The Ingestion Pipeline (Not the Interesting Part)

The Real Problem: Context Leakage

The Fix: Closures

Dynamic Tool Binding

Multi-Tool Orchestration

Edge Cases and Fixes

No fallback after empty RAG results

🔁 PDF persists across threads

FAISS disappears on refresh

Ghost responses during streaming

What I'd Do Differently

What Changed

What's Next

Comments

Building Narad — A Production LangGraph Chatbot

What Actually Happens When You Add Tools to a LangGraph Agent

More from this blog