Optimizing Agentic RAG with Graph-Based State Machines and Tool-Use Loops

Standard Retrieval-Augmented Generation (RAG) pipelines often fail when faced with multi-hop queries or documents requiring synthesis across disparate sources. The industry is shifting from linear Retrieve -> Augment -> Generate chains toward Agentic RAG: a pattern where an LLM orchestrates its own retrieval strategy, evaluates the relevance of results, and iterates until a high-confidence answer is formed.

In this post, we explore the architectural transition from static chains to graph-based state machines, focusing on implementation details using TypeScript and state-of-the-art orchestration patterns.

The Limitations of Linear RAG

Linear RAG assumes a single retrieval step is sufficient. In production, this breaks down in three specific scenarios:

Ambiguous Queries: The user's prompt lacks specificity, leading to irrelevant vector search results.
Multi-Step Reasoning: Answering the prompt requires data from two different indices (e.g., financial reports and market news).
Hallucination Recovery: The retrieved context is irrelevant, but the LLM attempts to answer anyway because the pipeline provides no feedback loop.

To solve this, we treat RAG as a directed graph where nodes represent actions (searching, grading, generating) and edges represent conditional logic based on the LLM's output.

Designing the State Machine

A robust Agentic RAG system requires a clear definition of state. Unlike a stateless API call, an agent needs to track its history, the current set of retrieved documents, and a 'retry' counter to prevent infinite loops.

Defining the State Schema

Using a library like LangGraph.js, we define our state as a shared object accessible by all nodes in the graph. LangGraph.js allows developers to build stateful, multi-actor applications with LLMs by modeling logic as nodes and edges in a graph.

interface AgentState {
  messages: BaseMessage[];
  documents: Document[];
  isRelevant: boolean;
  iterationCount: number;
}

The Core Workflow: Grade, Rewrite, Search

The most effective pattern for 2026 is the Corrective RAG (CRAG) pattern combined with Self-RAG. This involves three primary nodes:

1. The Retrieval Node

This node performs the initial vector search. However, instead of passing results directly to the generator, it passes them to a 'Grader'.

2. The Grader Node

We use a small, fast model (like GPT-4o-mini or Claude 3.5 Haiku) to perform a binary evaluation: Is this document actually useful for the query?

If the grader returns 'irrelevant', the graph triggers a conditional edge to a 'Rewrite' node. This node uses the LLM to transform the user's query into a better search term for a web search tool or a different vector index.

3. The Generation Node

Only when documents are verified as relevant does the system proceed to generation. If the generator detects it still cannot answer, it can signal the graph to loop back to the search phase with a refined query.

Implementing Conditional Logic

The power of this architecture lies in the conditional edges. In TypeScript, this looks like a router function that determines the next step based on the current state.

const decideToGenerate = (state: typeof AgentState.State) => {
  if (state.isRelevant) {
    return "generate";
  }
  if (state.iterationCount > 3) {
    return "fail_gracefully";
  }
  return "transform_query";
};

Tool-Use and Function Calling

To make the agent truly autonomous, we provide it with tools. In a modern stack, this involves Zod for schema validation and the LLM's native tool-calling capabilities. Zod is a TypeScript-first schema declaration and validation library that ensures the LLM provides structured arguments for tool execution.

When the agent decides it needs more info, it emits a tool_call. The graph executor intercepts this, runs the local function (e.g., a database lookup), and injects the result back into the messages state.

Performance and Latency Tradeoffs

Agentic RAG is inherently slower than linear RAG due to multiple LLM passes. To mitigate this in production:

Speculative Execution: Start the generation node and the grader node in parallel. If the grader fails, cancel the generation.
Small Model Grading: Use 7B or 8B parameter models for grading and query rewriting. They are significantly cheaper and faster for binary classification tasks.
Streaming Intermediate State: Use Server-Sent Events (SSE) to stream the agent's 'thoughts' or current status (e.g., "Searching for more sources...") to the UI to improve perceived latency.

Evaluation with RAGAS

Testing an agentic system is harder than testing a static one. We utilize RAGAS to measure metrics like Faithfulness (is the answer derived solely from context?) and Answer Relevancy. RAGAS provides a framework for reference-free evaluation of RAG pipelines, helping quantify the performance of retrieval and generation components.

In an agentic loop, we also track Path Efficiency: how many hops did the agent take to reach the correct answer? A high average hop count suggests the initial retrieval or the query rewriting logic needs tuning.

Conclusion

Moving to a graph-based agentic architecture transforms RAG from a fragile search-and-summarize tool into a resilient reasoning engine. By implementing explicit grading steps and self-correction loops, we can handle the edge cases that typically break LLM applications in production. The future of LLM engineering isn't just better prompts; it's better state management."}