Building AI Agents That Know When Not to Answer
Most AI support agents fail in the same way.
They answer questions they should not.
A wrong feature explanation is annoying.
A wrong refund is expensive.
An insensitive reply to a frustrated customer can turn into a PR problem.
The issue is not that the models are not good enough.
The issue is that they are given too much responsibility.
What matters in production is not just generating good responses. It is deciding when the agent should answer, when it should ask for help, and when it should step aside entirely.
That is what I built.
A customer support agent that uses conditional routing, human approval for high-risk actions, and retrieval grounded responses. This is the same pattern used by companies like Intercom, Zendesk, and Sierra in production systems.
Live demo: customer-support-agent-ui.onrender.com
GitHub: github.com/octavian115/customer-support-agent
Why This Architecture
The core pattern is simple.
classify → route → act → gate
It shows up in most production agent systems because it balances automation with control.
Every customer message first goes through a classifier. The agent decides what kind of request it is. A feature question, a technical issue, a billing request, or a user who clearly wants a human.
That decision determines what happens next.
FAQ and technical questions go through a retrieval pipeline. The agent pulls relevant documentation, generates a grounded answer, and responds immediately. No human involvement needed.
Billing actions such as refunds, cancellations, or plan changes also use retrieval to find the correct policy. But instead of acting directly, the agent drafts a response and pauses. A human reviewer can approve, edit, or reject it before anything is sent.
Escalations such as angry users, edge cases, or low confidence responses are handed off completely. The system includes a summary so the human does not have to start from scratch.
This creates a system with graduated autonomy.
Simple queries are handled automatically. High-risk decisions stay under human control.
In practice, this means the agent can handle most of the volume while humans stay involved where mistakes are costly.
Another important detail is that the system knows when it does not have enough information.
If retrieval confidence is low, the agent does not attempt an answer. It escalates instead.
A demo system will always produce a response.
A production system should know when not to.
The Architecture
Here is the full decision flow. Every message takes a different path depending on its intent and the system's confidence.
flowchart TD
A[Customer Message] --> B[Classifier]
B -->|faq / technical| C[RAG Node]
B -->|billing| D[RAG Node]
B -->|escalation| E[Escalation Node]
B -->|greeting / off-topic / closing| F[Simple Response Nodes]
C -->|confidence ≥ 0.60| G[Response Node]
C -->|confidence < 0.60| E
D --> H[Billing Node — HITL interrupt]
G --> I[END]
H --> I
E --> I
F --> I
The system has two interfaces:
- A customer chat where users interact with the agent
- An agent dashboard where human reviewers handle approvals and escalations
The stack is intentionally modular:
- LangGraph for orchestration and stateful workflows
- FastAPI for the backend API
- Streamlit for the frontend
- Pinecone for vector search
- GPT-4o for reasoning and generation
- OpenAI embeddings for retrieval
Key Design Decisions
This is where most of the production reliability comes from.
1. Structured Output for Classification
The classifier determines the path every message takes. If this step is unreliable, everything downstream breaks.
Instead of parsing free-form text, the model is constrained to return a fixed schema using Pydantic.
class IntentClassification(BaseModel):
intent: Literal[
"greeting", "faq", "technical", "billing",
"escalation", "off_topic", "closing"
]
This removes ambiguity completely.
The model cannot return variations like "billing question" or "technical issue". It must return one of the predefined categories.
In production, this eliminates an entire class of parsing bugs and edge cases.
2. Confidence-Based Escalation
Retrieval systems always return something. That does not mean the result is good.
The agent uses the similarity score from Pinecone as a confidence signal. If the score is below a threshold, the system does not generate a response.
It escalates instead.
def route_after_rag(state: SupportState) -> str:
if state["intent"] == "billing":
return "billing_node"
if state["confidence"] >= CONFIDENCE_THRESHOLD:
return "response_node"
else:
return "escalation_node"
This is a simple mechanism, but it changes behavior completely.
A demo system always produces an answer.
A production system avoids answering when it is uncertain.
The threshold is configurable in one place, which makes it easy to tune based on real usage.
3. Human-in-the-Loop at the Right Point
Human approval is not applied globally. It is inserted exactly where risk is highest.
In this system, that point is billing actions.
When the agent prepares something like a refund or cancellation, the graph pauses. The current state is saved, and a reviewer is shown the proposed action.
The reviewer can:
- approve it as is
- edit and approve
- reject it
Once a decision is made, the graph resumes from the same point.
This approach keeps low-risk paths fast while adding control only where it is needed.
4. State as the Communication Layer
All nodes communicate through a shared state object.
- The classifier writes
intent - The RAG node writes
retrieved_docsandconfidence - The billing node reads both before deciding what to do
Nodes do not call each other directly. They read from and write to state.
For example, if the classifier sets intent = "billing", the routing logic sends the request to the billing node without any hardcoded coupling.
This makes the system modular.
You can swap out the retrieval layer, change the classifier, or update the LLM without rewriting the rest of the graph.
Evolution from Previous Projects
This project builds on my earlier work.
Narad (GitHub) was my first LangGraph chatbot. It had tools like web search, calculator, and stock lookup. It was stateful and observable, but every message followed the same path.
My YouTube Q&A RAG app focused on retrieval. It used Pinecone, hybrid search, and reranking to improve answer quality.
This project combines both and adds decision-making.
- Narad focused on tool usage
- RAG app focused on knowledge retrieval
- This system focuses on routing, control, and reliability
The shift is from generating responses to deciding how to handle requests.
What I'd Add Next
There are a few clear next steps.
PostgreSQL checkpointing
Replace in-memory state with persistent storage so conversations survive restarts.Evaluations
Build a dataset of queries across all intents and measure classification accuracy, retrieval quality, and response correctness. This is a missing layer in most agent systems.Multi-agent workflows
Split billing into a dedicated sub-agent that can handle multi-step processes like eligibility checks and confirmations.
Try It
The demo is live and set up to show different behaviors.
Try sending:
- a basic product question
- a refund request
- something not covered in the docs
- an angry message
Each one takes a different path through the system.
Live demo: customer-support-agent-ui.onrender.com
GitHub: github.com/octavian115/customer-support-agent
LinkedIn: www.linkedin.com/in/ayushkumar115
Closing Thoughts
The biggest shift in building AI systems is moving from generation to decision-making.
The question is no longer just what the model should say.
It is whether the model should act at all.
That is where most of the engineering effort goes in production systems.