Skip to main content

Command Palette

Search for a command to run...

Your Agent Should Decide When to Use RAG

Updated
6 min read
A
Exploring developments in AI and building things that I learn. Currently diving deep into LLMs, RAG systems, and Agentic AI. Here to document what i learn.

Most RAG tutorials assume every query needs retrieval.

That's a bad default.

In a real agentic system, retrieval is just one tool — and often the wrong one.

So instead of building another RAG pipeline, I added RAG to Narad — my LangGraph chatbot — as a tool the agent can choose to ignore.

That decision changed everything.


Where Narad Started

Narad is a multi-session chatbot built with LangGraph, Streamlit, and PostgreSQL.

If you want the full backstory, I've written about the initial build and adding tools in previous posts.

Before RAG, the architecture looked like this:

  • A StateGraph with a chat_node (GPT-4o with tool calling)
  • A ToolNode for execution
  • tools_condition for routing
  • Tools: calculator, Tavily search, yFinance

Tools were defined once at startup and statically bound to the LLM.

The checkpointer persisted conversations to PostgreSQL, so threads survived restarts. The frontend streamed responses via Streamlit.

The system worked.

Narad is a simple agent — single LLM, multiple tools, autonomous routing. The model decides which tools to call, how many, and whether to call any at all. No hardcoded if-else chains.

The question was:

How do you add per-user document retrieval without turning the entire system into a RAG pipeline?


Choosing the Right Vector Store

I've used Pinecone before. My YouTube RAG project uses it with namespaces, hybrid search, and reranking.

So my first instinct was: use Pinecone again.

But that instinct was wrong.

The actual use case here is simple:

  • User uploads a PDF
  • Asks questions during the session
  • Leaves
  • Data is no longer needed

The embeddings are ephemeral by design.

Using Pinecone here would mean:

  • Managing namespaces
  • Cleaning up data
  • Paying for storage you don't need
  • Making API calls for every retrieval

All for data that is intentionally temporary.

FAISS, on the other hand:

  • Lives in memory
  • Dies on restart
  • Requires zero cleanup
  • Has zero network overhead

That's not a limitation here.

That's the feature.

Tradeoff: If the user refreshes, the index is gone and the PDF must be re-uploaded. For a session-scoped system, that's acceptable. For a production system with persistent documents, it's not.

Takeaway: Your vector store should match your data lifecycle. Persistent data → persistent store. Ephemeral data → in-memory is enough.


The Ingestion Pipeline (Not the Interesting Part)

The ingestion pipeline isn't the interesting part.

If you've built RAG before, you already know this flow:

  • Upload PDF
  • Chunk it
  • Embed it
  • Store vectors
  • Create retriever

Here's the implementation:

def ingest_pdf(file_bytes: bytes, thread_id: str, filename: str = None) -> dict:
    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
        tmp.write(file_bytes)
        tmp_path = tmp.name

    try:
        docs = PyPDFLoader(tmp_path).load()
        chunks = RecursiveCharacterTextSplitter(
            chunk_size=1000, chunk_overlap=200
        ).split_documents(docs)

        vector_store = FAISS.from_documents(chunks, embeddings)
        retriever = vector_store.as_retriever(
            search_type="similarity", search_kwargs={"k": 4}
        )

        _THREAD_RETRIEVERS[str(thread_id)] = retriever
        _THREAD_METADATA[str(thread_id)] = {
            "filename": filename,
            "pages": len(docs),
            "chunks": len(chunks),
        }

        return _THREAD_METADATA[str(thread_id)]
    finally:
        os.remove(tmp_path)

The important part isn't ingestion.

It's how this retriever integrates with the agent.


The Real Problem: Context Leakage

The RAG tool needs access to a thread-specific retriever.

Each user has a different FAISS index.

So how do you pass that into the tool?

Naive approach

@tool
def rag_tool(query: str, thread_id: str) -> dict:
    retriever = _THREAD_RETRIEVERS.get(thread_id)
    # ...

This pushes responsibility onto the LLM.

Which is a mistake.

You're asking the model to:

  • Remember a UUID
  • Pass it correctly
  • Not hallucinate it

LLMs are bad at plumbing.


The Fix: Closures

The solution is to remove the problem from the LLM entirely.

def make_rag_tool(thread_id: str):
    retriever = _THREAD_RETRIEVERS.get(str(thread_id))

    @tool
    def rag_tool(query: str) -> dict:
        """
        Retrieve relevant information from the uploaded PDF document.
        Use this when the user asks questions about their uploaded document.
        """
        if retriever is None:
            return {"error": "No PDF uploaded."}

        results = retriever.invoke(query)
        return {
            "query": query,
            "context": [doc.page_content for doc in results],
            "metadata": [doc.metadata for doc in results],
        }

    return rag_tool

The key idea:

Don't let the LLM manage context it shouldn't even see.

Capture it in a closure instead.

The model only sees rag_tool(query: str). It has no idea there's a thread-specific vector store wired in behind the scenes.


Dynamic Tool Binding

Before RAG, the tools list was static:

tools = [calculator, search, get_stock_info]
llm.bind_tools(tools)

Now it's dynamic:

def get_tools_for_thread(thread_id: str):
    tools = base_tools.copy()
    if thread_id in _THREAD_RETRIEVERS:
        tools.append(make_rag_tool(thread_id))
    return tools

And inside the node:

tools = get_tools_for_thread(thread_id)
llm.bind_tools(tools)

This is the moment Narad stopped being a chatbot.

And started becoming an agent.

Tools are no longer static. They are part of runtime state.


Multi-Tool Orchestration

This was the first moment the system felt different.

Query:

"Explain KNN from the document and tell me the next Liverpool match."

What happened:

  • rag_tool → handled KNN from PDF
  • search → fetched match schedule

Single query. Multiple tools. One response.

No routing logic. No special casing.

Just the model deciding what to do.

That's the difference between:

  • a RAG pipeline
  • an agent with tools

Edge Cases and Fixes

Building the happy path took a day. Handling the edge cases took longer.

No fallback after empty RAG results

RAG returned nothing. The agent stopped.

Fix (temporary): prompt instruction to fallback to search.

Reality: This is a hack.

Better solution: graph-level conditional routing. That's on the roadmap.

🔁 PDF persists across threads

Streamlit uploader didn't reset on new chat.

Fix:

key=f"pdf_uploader_{thread_id}"

FAISS disappears on refresh

Expected behavior.

In-memory store → process restart = data gone.

Tradeoff accepted.

Ghost responses during streaming

Tool status and stream output overlapped in UI.

Fix: separate rendering containers for tool status and streamed text.


What I'd Do Differently

1. Move fallback logic out of prompts

Prompt-based routing works — until it doesn't.

If this were production, I'd move this into the graph layer.

2. Decouple backend from Streamlit

Right now everything is coupled through session state.

A FastAPI backend would:

  • Separate concerns
  • Enable multiple clients
  • Make the system production-ready

3. Use persistent vector store (if needed)

FAISS works because the use case is session-scoped.

For real users:

  • Pinecone
  • Qdrant
  • User-scoped namespaces

What Changed

The biggest shift wasn't adding RAG.

It was making it optional.

In most tutorials, every query goes through retrieve → augment → generate.

In Narad, the agent decides when retrieval is worth doing — and when it's not.

That's the difference between demos and systems.


What's Next

Narad now supports:

  • Calculator
  • Web search
  • Stock lookup
  • Document Q&A

All orchestrated dynamically by the agent.

Next:

  • Human-in-the-loop workflows (interrupt())
  • Confidence-based routing
  • FastAPI backend

If you want to explore the code: GitHub

Try it live: narad-chat.onrender.com


*This is part 3 of the Narad series.

Part 1: Initial Build

Part 2: tool calling

8 views

Building Narad — A Production LangGraph Chatbot

Part 2 of 2

A behind-the-scenes series on building Narad — a production-grade LangGraph chatbot — from a basic stateful graph to a tool-augmented assistant with cloud persistence, streaming, and observability. Every post covers real problems, real fixes, and the reasoning behind each decision.

Start from the beginning

What Actually Happens When You Add Tools to a LangGraph Agent

Most LangGraph tutorials stop at "it works." This post is about what breaks right after that. There's a big difference between understanding tool calling in theory and wiring it into a production-grad