Building AI Agents That Know When Not to Answer

Most AI support agents fail in the same way.

They answer questions they should not.

A wrong feature explanation is annoying.
A wrong refund is expensive.
An insensitive reply to a frustrated customer can turn into a PR problem.

The issue is not that the models are not good enough.

The issue is that they are given too much responsibility.

What matters in production is not just generating good responses. It is deciding when the agent should answer, when it should ask for help, and when it should step aside entirely.

That is what I built.

A customer support agent that uses conditional routing, human approval for high-risk actions, and retrieval grounded responses. This is the same pattern used by companies like Intercom, Zendesk, and Sierra in production systems.

Live demo: customer-support-agent-ui.onrender.com

GitHub: github.com/octavian115/customer-support-agent

Why This Architecture

The core pattern is simple.

classify → route → act → gate

It shows up in most production agent systems because it balances automation with control.

Every customer message first goes through a classifier. The agent decides what kind of request it is. A feature question, a technical issue, a billing request, or a user who clearly wants a human.

That decision determines what happens next.

FAQ and technical questions go through a retrieval pipeline. The agent pulls relevant documentation, generates a grounded answer, and responds immediately. No human involvement needed.
Billing actions such as refunds, cancellations, or plan changes also use retrieval to find the correct policy. But instead of acting directly, the agent drafts a response and pauses. A human reviewer can approve, edit, or reject it before anything is sent.
Escalations such as angry users, edge cases, or low confidence responses are handed off completely. The system includes a summary so the human does not have to start from scratch.

This creates a system with graduated autonomy.

Simple queries are handled automatically. High-risk decisions stay under human control.

In practice, this means the agent can handle most of the volume while humans stay involved where mistakes are costly.

Another important detail is that the system knows when it does not have enough information.

If retrieval confidence is low, the agent does not attempt an answer. It escalates instead.

A demo system will always produce a response.
A production system should know when not to.

The Architecture

Here is the full decision flow. Every message takes a different path depending on its intent and the system's confidence.

flowchart TD
    A[Customer Message] --> B[Classifier]
    B -->|faq / technical| C[RAG Node]
    B -->|billing| D[RAG Node]
    B -->|escalation| E[Escalation Node]
    B -->|greeting / off-topic / closing| F[Simple Response Nodes]

    C -->|confidence ≥ 0.60| G[Response Node]
    C -->|confidence < 0.60| E

    D --> H[Billing Node — HITL interrupt]

    G --> I[END]
    H --> I
    E --> I
    F --> I

The system has two interfaces:

A customer chat where users interact with the agent
An agent dashboard where human reviewers handle approvals and escalations

The stack is intentionally modular:

LangGraph for orchestration and stateful workflows
FastAPI for the backend API
Streamlit for the frontend
Pinecone for vector search
GPT-4o for reasoning and generation
OpenAI embeddings for retrieval

Key Design Decisions

This is where most of the production reliability comes from.

1. Structured Output for Classification

The classifier determines the path every message takes. If this step is unreliable, everything downstream breaks.

Instead of parsing free-form text, the model is constrained to return a fixed schema using Pydantic.

class IntentClassification(BaseModel):
    intent: Literal[
        "greeting", "faq", "technical", "billing",
        "escalation", "off_topic", "closing"
    ]

This removes ambiguity completely.

The model cannot return variations like "billing question" or "technical issue". It must return one of the predefined categories.

In production, this eliminates an entire class of parsing bugs and edge cases.

2. Confidence-Based Escalation

Retrieval systems always return something. That does not mean the result is good.

The agent uses the similarity score from Pinecone as a confidence signal. If the score is below a threshold, the system does not generate a response.

It escalates instead.

def route_after_rag(state: SupportState) -> str:
    if state["intent"] == "billing":
        return "billing_node"
    if state["confidence"] >= CONFIDENCE_THRESHOLD:
        return "response_node"
    else:
        return "escalation_node"

This is a simple mechanism, but it changes behavior completely.

A demo system always produces an answer.
A production system avoids answering when it is uncertain.

The threshold is configurable in one place, which makes it easy to tune based on real usage.

3. Human-in-the-Loop at the Right Point

Human approval is not applied globally. It is inserted exactly where risk is highest.

In this system, that point is billing actions.

When the agent prepares something like a refund or cancellation, the graph pauses. The current state is saved, and a reviewer is shown the proposed action.

The reviewer can:

approve it as is
edit and approve
reject it

Once a decision is made, the graph resumes from the same point.

This approach keeps low-risk paths fast while adding control only where it is needed.

4. State as the Communication Layer

All nodes communicate through a shared state object.

The classifier writes intent
The RAG node writes retrieved_docs and confidence
The billing node reads both before deciding what to do

Nodes do not call each other directly. They read from and write to state.

For example, if the classifier sets intent = "billing", the routing logic sends the request to the billing node without any hardcoded coupling.

This makes the system modular.

You can swap out the retrieval layer, change the classifier, or update the LLM without rewriting the rest of the graph.

Evolution from Previous Projects

This project builds on my earlier work.

Narad (GitHub) was my first LangGraph chatbot. It had tools like web search, calculator, and stock lookup. It was stateful and observable, but every message followed the same path.

My YouTube Q&A RAG app focused on retrieval. It used Pinecone, hybrid search, and reranking to improve answer quality.

This project combines both and adds decision-making.

Narad focused on tool usage
RAG app focused on knowledge retrieval
This system focuses on routing, control, and reliability

The shift is from generating responses to deciding how to handle requests.

What I'd Add Next

There are a few clear next steps.

PostgreSQL checkpointing
Replace in-memory state with persistent storage so conversations survive restarts.
Evaluations
Build a dataset of queries across all intents and measure classification accuracy, retrieval quality, and response correctness. This is a missing layer in most agent systems.
Multi-agent workflows
Split billing into a dedicated sub-agent that can handle multi-step processes like eligibility checks and confirmations.

Try It

The demo is live and set up to show different behaviors.

Try sending:

a basic product question
a refund request
something not covered in the docs
an angry message

Each one takes a different path through the system.

Live demo: customer-support-agent-ui.onrender.com

GitHub: github.com/octavian115/customer-support-agent

LinkedIn: www.linkedin.com/in/ayushkumar115

Closing Thoughts

The biggest shift in building AI systems is moving from generation to decision-making.

The question is no longer just what the model should say.

It is whether the model should act at all.

That is where most of the engineering effort goes in production systems.