Agentic RAG Patterns in Production

Date: 2026-05-31 Type: Research Status: Tier-D pipeline complete — 73 sources, 5 web search angles, parallel Gemini CLI cross-check, red-team critique pass Sources: agentic-rag-patterns-in-production-2026-2026-05-31.sources.json

TL;DR

Naive RAG is dead. Agentic RAG — where a router/planner/critic loop decides whether, what, how many times to retrieve — is the dominant production pattern in 2026. The default stack is LangGraph for orchestration + LlamaIndex Workflows for retrieval + Ragas/Phoenix/Langfuse for eval, sitting on a hybrid retriever (BM25 + dense + RRF) → cross-encoder reranker core. Long-context (1M tokens) did not kill RAG; it changed the boundary — prompt the chunk that lives in working memory, retrieve the rest. The biggest operational risks are cost amplification (3–10×), latency tax (5–15 s), retrieval thrash, hallucinated tool calls, and supervisor retry storms — each has a named mitigation: adaptive routing, semantic cache, critic-loop with iteration cap, structured-output enforcement, in-state circuit breakers. The biggest 2026 platform shifts are Anthropic Programmatic Tool Calling (PTC) (orchestrate via code, –37 % tokens), MCP standardisation under the Linux Foundation (Dec 2025), OpenAI Assistants → Responses API hard cutover (26 Aug 2026), and Claude Skills + Memory MCP for layered context engineering.

1. What makes a RAG system "agentic" in 2026

Classic RAG: query → retrieve once → stuff into prompt → generate. Agentic RAG: an LLM-controlled control loop wraps the retriever. The agent decides:

Whether to retrieve at all (router / classifier — many queries are pure parametric knowledge).
What to retrieve (query decomposition, HyDE, sub-question rewriting).
How to retrieve (which tool: vector store, SQL, web fallback, KG, MCP server).
Whether the result is good enough (CRAG-style grader, Self-RAG reflection).
What to do next (re-query, escalate to a different tool, ask user, stop).

The pattern that distinguishes 2026 from 2024 is that every component above is independently replaceable — what the community calls Modular RAG (Likhon, MarsDevs). The other defining shift is agentic ≠ tool-using-chatbot. It is a stateful control flow with shared state, checkpointing, and explicit termination conditions. LangGraph (state graph), LlamaIndex Workflows (event-driven async steps), and OpenAI Agents SDK / Responses API are the three production-grade ways to express it.

The foundational academic patterns it descends from are Self-RAG (arXiv 2310.11511), CRAG (arXiv 2401.15884), and the Agentic RAG Survey (arXiv 2501.09136).

2. The five named production patterns

Production agentic RAG in 2026 is consistently described as five composable patterns layered on top of the hybrid retriever:

Pattern	What it does	Source academic anchor
Adaptive routing	Tiny classifier (T5-large or 8B) tags query as no-retrieval / 1-hop / multi-hop, routes to the right pipeline. Cuts avg cost 30–50 % when query mix is bimodal.	Adaptive-RAG paper
Corrective retrieval (CRAG)	After retrieve, grade context. If score < τ → re-query / web-fallback. Removes the "confidently wrong" failure mode.	arXiv 2401.15884
Self-reflection (Self-RAG)	Model emits special tokens (`[Retrieve]`, `[ISREL]`, `[ISSUP]`, `[ISUSE]`) — generates, critiques its own draft against retrieved context, re-retrieves when evidence is thin.	arXiv 2310.11511
Sub-question / planner-critic	Decompose complex query into sub-questions, fan out retrievals, synthesise. Implemented in LlamaIndex `SubQuestionQueryEngine` and LangGraph branched nodes.	LlamaIndex/LangGraph docs
Graph traversal (GraphRAG / HippoRAG / PathRAG)	Build a KG over the corpus, hop along entity relations for multi-hop / relationship queries.	Microsoft GraphRAG, HippoRAG

A canonical control flow stitches them together: classifier → planner → hybrid retriever → reranker → critic → (loop or finalise), with iteration capped at 5–6 to prevent runaway cost (MarsDevs 2026 production guide).

3. The dominant 2026 stack

Cross-source consensus: the 2026 default for serious enterprise deployments is

LangGraph        (orchestration / supervisor / state graph / checkpoints / HITL)
LlamaIndex       (retrieval depth: ingestion, chunking, hybrid, Workflows)
DSPy             (compiled prompts for the high-leverage prompt-brittle steps)
Cohere/Voyage    (reranker)
Ragas + Phoenix + Langfuse  (offline metrics / prod tracing / OSS observability)

(MarsDevs, AlphaCorp Top-5 Frameworks 2026, Contra Collective LangChain-vs-LlamaIndex 2026, RahulKolekar production-RAG 2026, FutureAGI guide.)

The old "LangChain = agents, LlamaIndex = retrieval" split is gone — both have crossed into each other's territory. Teams pick LangGraph for stateful supervisor graphs (production at Uber, JP Morgan, BlackRock, Cisco, LinkedIn, Klarna; LangGraph 1.0 GA Oct 2025; 90 M monthly downloads — AlphaBold). They pick LlamaIndex Workflows when retrieval depth is the dominant work; Workflows shifted to an event-driven, async, microservice-friendly architecture in 2026 served via llama-deploy (Gemini CLI). DSPy is reached for whenever prompt brittleness across model swaps becomes a real maintenance cost — Signatures + Optimizers (BootstrapFewShot, MIPROv2, GEPA) re-compile prompts when you change models, no manual rewriting (DSPy docs, MyEngineeringPath, SurePrompts).

Vendor-specific patterns

Anthropic — Programmatic Tool Calling (PTC). Instead of round-tripping the model for every tool call, Claude writes a Python script that runs inside a sandboxed Code Execution container; the host only forwards tool inputs and Claude only sees the final aggregated result. Anthropic-reported wins on production research workloads: avg tokens 43 588 → 27 297 (−37 %), internal knowledge retrieval 25.6 % → 28.5 %, GAIA 46.5 % → 51.2 %. Container expires after ~4.5 min idle. The pattern is load-bearing in Claude-for-Excel (thousand-row spreadsheets). Currently API + Microsoft Foundry only — no Claude-Code support as of Feb 2026 (Anthropic docs + cookbook, iKangai write-up).
OpenAI — Assistants → Responses + Conversations. Assistants API sunsets 26 Aug 2026, no extension. Conceptual mapping: Assistant→Prompt (dashboard-only, versionable), Thread→Conversation (stores items not just messages), Run→Response. Responses claim 40–80 % cache-utilisation gain over Chat Completions. File-search-heavy RAG apps with persistent vector stores are the most impacted; Azure OpenAI Assistants ride the same deadline — migrate to Foundry Agent Service (OpenAI deprecations, OpenAI migration guide, Microsoft Q&A, Ragwalla guide).
LangGraph — Supervisor pattern + circuit breakers. Supervisor sits between the "network" (every-agent-can-call-every-agent → chaos) and "hierarchical" (supervisor-of-supervisors → overkill). In production the circuit breaker must live in shared graph state (failed_services: Annotated[set[str], set_merge]); the router checks the breaker, not the LLM prompt. Anti-pattern: with_retry(stop_after_attempt=10) on a tool used by 10 parallel agents creates a 100-request retry storm when the downstream is dead. p99 latency reportedly dropped 15 s → 3 s after adding the in-state breaker (LifeTidesHub, CallSphere, BuildMVPFast, eightlabs08/multi-agent-research reference repo).

4. The retrieval layer — what every agentic RAG sits on

Underneath the agent loop is a retriever that the community has converged on. The 2026 production baseline is a two-stage pipeline:

Stage A — hybrid top-N (recall)

BM25 + dense ANN, fused with Reciprocal Rank Fusion (RRF), N = 50–100. Reason: BM25 and dense fail in complementary ways. BM25 nails rare/identifier/code/SKU queries and dies on paraphrase; dense nails paraphrase and dies on rare identifiers. On enterprise corpora hybrid + RRF outperforms either alone by 5–15 pts recall@10 (DigitalApplied hybrid-search reference). On a 2026 text-and-table benchmark, hybrid alone hits Recall@5 = 0.695 vs BM25 alone 0.644 vs dense alone 0.587 (arXiv 2604.01733). On financial documents BM25 still beats text-embedding-3-large on every metric except Recall@20 (TianPan).

Stage B — cross-encoder reranker (precision)

Re-score the top-50 with a cross-encoder, keep top-3 to top-10 for the LLM. Adding Cohere Rerank to Hybrid+RRF on the same 2026 benchmark lifts Recall@5 to 0.816 (+17 % rel.) and MRR@3 to 0.605 (+40 % rel.) (arXiv 2604.01733).

The 2026 reranker market:

Reranker	Notes
Cohere Rerank 3.5 / 4.0 Pro	Production baseline for teams on the Cohere API. 300 K tok/min throughput.
BGE-reranker-v2 (BAAI)	Self-host default. `bge-reranker-base` for latency-critical, large variant when budget allows.
BGE-M3	Unified dense+sparse+late-interaction in one 550 M checkpoint — drops infra complexity.
Voyage rerank-2.5 (Aug 2025)	Instruction-following ("prefer regulatory-compliance results") — qualitatively different from score-only. Vendor-stated +7.94 % vs Cohere v3.5.
mxbai-rerank-large-v2	Open-weight competitor.

Production latency rule: a cross-encoder over 30 candidates is ~100–200 ms. The same model over 200 candidates is 5–10× that. If you need to rerank 200 to get acceptable precision, the first stage is broken — fix Stage A.

Anthropic Contextual Retrieval (the September 2024 inflection)

Before embedding each chunk, prepend a 50–100-token chunk-specific context snippet generated by a cheap LLM ("This chunk is from the Q2 2024 earnings call, discussing renewable-energy revenue trends"). Then embed the contextualised chunk. Anthropic-published numbers on retrieval failure:

Pipeline	Failure rate	Δ vs baseline
Plain RAG	5.7 %	—
+ Contextual Embeddings	3.7 %	−35 %
+ Contextual + BM25	2.9 %	−49 %
+ Contextual + BM25 + reranker	1.9 %	−67 %

(Anthropic Cookbook, DataCamp, AWS Bedrock implementation, Together AI guide.)

The technique is only economical because prompt caching drops the per-chunk contextualisation cost ~87 % (1000-doc corpus: ≈$94 → ≈$12 on Claude Sonnet 4.5). Pair contextual retrieval with prompt caching or skip it.

Query-side techniques

HyDE (Hypothetical Document Embeddings) — generate a fake answer, embed it, search. Helps when query and document vocabularies diverge.
Query decomposition / sub-question — split into atomic sub-queries, fan out retrieval, fuse results. Standard in LlamaIndex.
Query rewriting / multi-query — LLM emits 3–5 paraphrases, retrieve each, union results, dedupe.

5. Architecture variants — when to use which

Use case	Architecture
Simple factual lookup	Naive or Advanced RAG + reranker
Hallucination-sensitive (medical, legal, finance)	Self-RAG or CRAG
Domain-shift / failing retrieval	CRAG with web-search fallback
Multi-hop reasoning across entities	GraphRAG (high quality, high indexing cost)
Multi-hop on a budget	HippoRAG — 10–30× cheaper than GraphRAG for similar multi-hop accuracy
Mixed query distribution	Adaptive RAG — classifier routes to the right pipeline
Audit / regulated / explainable	Graph-RAG (subgraph citations) + Self-RAG (provenance)

Microsoft GraphRAG hit 86 % accuracy vs 32 % baseline on enterprise benchmarks but with a ~$33 K indexing cost on large corpora that priced most teams out. HippoRAG's neurobiologically-inspired retrieval brought multi-hop reasoning within 10–30× cost reduction of GraphRAG, and PathRAG / OG-RAG continued to bring the cost curve down through 2025–26 (Graph Praxis, Atlan, Starmorph).

Cross-benchmark reality check: state-of-the-art RAG answers only 63 % of factual questions correctly on the CRAG Benchmark; straightforward RAG without advanced techniques scores 44 %. Advanced patterns close the gap but do not eliminate it (Atlan 12 Techniques).

A real production case study: a self-correcting agentic Graph-RAG validated in clinical decision support for hepatology, peer-reviewed in PubMed (PMC12748213).

6. Vector database selection — 2026

The field consolidated to ~8 production-grade systems. Decision dimensions: managed-vs-self-host, scale tier, hybrid-search depth, existing data platform. Raw QPS rarely decides.

DB	Sweet spot	Notes
pgvector / pgvectorscale	≤10 M vectors, Postgres-native, ACID required	HNSW since 0.5.0 made it competitive; Supabase benchmarks show pgvector HNSW beating Qdrant at 1 M @ 99 % recall. pgvectorscale = 471 QPS @ 99 % recall on 50 M but quality degrades >10 M.
Qdrant	Open-source speed leader, payload filtering	Rust; p99 ~12 ms @ 10 M (Weaviate ~16, Milvus ~18); 41 QPS @ 99 % recall on 50 M. Easiest dedicated DB to self-host.
Weaviate	Best native hybrid (BM25+dense+metadata) in one query	Pricing restructured Oct 2025: Flex $45/mo, $280/mo annual, Premium $400/mo. GraphQL API has a learning curve.
Pinecone	Fully managed, devs > infra	Closed-source; can't tune HNSW params; eventual consistency; costly at enterprise scale (5–10× self-host Qdrant/Milvus).
Milvus / Zilliz	≥1 B vectors, distributed	Zilliz Cardinal engine claims 10× open-source Milvus. Operational overhead not worth it < 100 M.
Chroma	Embedded / prototyping	Zero-infra cost in embedded mode.
LanceDB / Marqo	Multi-modal	Vector + image + audio in one.
Vespa	≥1 B + low-latency search	Yahoo-grade. Use when Milvus distributed runs out of headroom.

Decision heuristic (Pratik Rupareliya, 100+ enterprise deployments): under 10 M, anything works; 10 M–1 B, narrow to Pinecone / Qdrant / Weaviate / Milvus / Vespa; above 1 B, Vespa or Milvus distributed. The data is portable; the lock-in lives in the queries, not the vectors.

7. MCP — the standardised transport that ate function calling

Model Context Protocol (Anthropic, Nov 2024; donated to the Agentic AI Foundation under the Linux Foundation, Dec 2025) is the JSON-RPC 2.0 standard for connecting LLMs to external tools and data sources. OpenAI, Microsoft, Google, AWS all adopted. In 2026 the Gemini-CLI cross-check counts 14 000+ MCP servers.

Architecture:

Host / Client: embedded in the AI app (Claude Desktop, Cursor, agent framework).
Server: wraps a system (Postgres, GitHub, file system, a RAG corpus) and exposes typed tools, resources, prompts via JSON-RPC.
Transport: stdio (local), HTTP/SSE, WebSocket.

MCP for RAG specifically. An MCP server exposes a RAG pipeline as a Resource. The agent calls resources/read, the server runs hybrid retrieval + rerank + (optionally) generation, returns grounded context. This decouples retrieval from the agent and makes the same retriever consumable by Claude, Gemini, OpenAI Responses, Cursor, Windsurf, etc. — "build the server once, every vendor's agent can call it".

MCP vs RAG vs function calling — clean separation:

RAG = pattern for grounding LLM responses in retrieved documents.
MCP = protocol for connecting LLMs to any external tool, including RAG retrievers and write actions.
Function calling = vendor-specific tool-use; MCP standardises what function calling left vendor-specific.

The 2026 production rule (Aetherlink): RAG is the knowledge layer, MCP is the operational layer; mature Claude deployments use both. RAG handles static / semi-static documents (PDFs, policies, manuals); MCP handles live business systems (CRM, ERP, DB) where data changes in real time and you need write access.

Security: MCP spec mandates OAuth 2.1 with PKCE, recommends least-privilege + human-in-the-loop consent. Claude can only do what an MCP server exposes — a feature, not a limitation, especially in regulated deployments.

8. Long-context vs RAG — the 2026 verdict

The "RAG is dead because Gemini 1.5 / Llama 4 Scout / Claude 1 M" argument was decisively settled by 2026: long context did not kill RAG, it shifted the boundary.

Cost. GPT-4.1 input is $2/M tokens. A 100 K-tok prompt costs $0.20 in input alone; a 1 M-tok prompt costs $2 per call before output. A RAG query that retrieves ~4 K tokens of relevant chunks costs ~$0.00008. At 10 K queries/day on a 500 K corpus, long-context = $12 500/day, RAG = $100/day — 125× (open-techstack, April 2026 list prices).

Accuracy. Gemini 1.5 Pro: 99.7 % recall on single-needle haystack tests, but ~60 % average recall on realistic multi-fact retrieval. Attention scales quadratically — 1M tok ≠ 10K tok × 100.

Latency. ~160 K tok ≈ 20 s; ~890 K tok ≈ 60 s; production averages ≈ 45 s. Unusable for interactive UX.

What changed the math. Anthropic's flat-rate pricing for Claude's 1 M context (the per-token price doesn't increase as you go deeper) plus prompt caching (87 % cost reduction on repeated prefixes) makes long-context attractive when the same prefix is reused across many calls. Caching changes the decision — if 1 000 questions all reuse the same 200 K-token handbook, 1 M-context + cache can beat repeated RAG retrieval.

Production rule of thumb: 64 K–200 K is the sweet spot for most production work. Use 1 M context for async workloads — full-codebase analysis, full-corpus compliance audits, repo-wide reasoning, document-relationship queries that retrieval would shred. Hybrid wins: RAG narrows 5 M tokens of corpus down to 50–200 K relevant tokens, long-context reasons across them. NotebookLM / Gemini Deep Research / most production AI search products are built this way.

9. Eval + observability stack

Production-grade eval is layered, not single-tool. The 2026 consensus:

Layer	Tool	Role
Offline metrics (dev iteration)	Ragas	Faithfulness, answer-relevance, context-precision, context-recall. Reference-free. Run on a 50–100 golden-set before shipping.
Offline metrics (CI gate)	DeepEval	50+ metrics; pytest-style; multi-turn `ConversationalTestCase`; multimodal + safety.
Tracing + observability	TruLens	OpenTelemetry-based span-level diagnosis.
Production observability	Arize Phoenix	50+ research-backed metrics; trace clustering; retrieval-relevancy viz; agent trajectory analysis. ELv2 license (not OSI).
Production observability OSS	Langfuse	MIT-licensed, self-host on Postgres+ClickHouse. Acquired by ClickHouse Jan 2026. Trace = list of observations (great for prompt-centric, slower for 30-min agent runs).
Framework-coupled	LangSmith	Best agent IDE (LangGraph Studio: breakpoints, state-diff, replay). Cloud-default, OTel since Mar 2026. Strong lock-in to LangChain abstractions. $39/seat + $0.50/1k traces.
Unified harness	MLflow	Plugs Ragas, DeepEval, Phoenix, TruLens, Guardrails as pluggable scorers in `mlflow.genai.evaluate()`.

Framework heuristic (widely repeated 2026): "If you're on LangGraph → LangSmith. If you're framework-agnostic → Langfuse. If eval rigor is the priority → Arize Phoenix."

Why agent observability ≠ APM. Agents fail in ways that look like success: well-formed but wrong outputs, unnecessary tool calls, syntactically valid but semantically wrong actions. A tool failure at step 3 silently corrupts steps 4–8; this is invisible to call-level monitoring but visible in full-session step-level tracing. Step-level trace is the MVP signal, not request-latency / error-rate.

Production caveats: - A RAG system can score 0.95 faithfulness and still be wrong if retrieved content is stale or incorrect — no framework distinguishes factually wrong context from correct context (Maxim AI). - Judge bias: LLM-judge from the same vendor as the generator is too forgiving — use a different family for judging. - Ground-truth drift: golden sets go stale; refresh quarterly. - Single-score blindness: 90 % average can hide 60 % on the highest-stakes query class — slice by query class always. - DeepEval: needs Python expertise; heavy LLM-judge usage hits rate limits; no built-in observability.

10. Failure modes + mitigations

A condensed map of what kills production agentic RAG and what the field has converged on as the fix:

Failure mode	Mitigation
Confidently-wrong answers from irrelevant context	CRAG-style grader; faithfulness judge in the loop.
Multi-hop drift (naive 4-call agent hallucinates more than vanilla RAG because it has more chances to drift)	Critic agent + re-retrieve-on-failure loop with backing-chunk requirement per claim (FutureAGI).
Retrieval thrash (agent keeps retrieving but can't target the gap, context bloat).	Reranking + contextual retrieval; query-rewrite step uses the gap, not the original query.
Hallucinated tool calls (agent invents API params).	Structured-output enforcement (Pydantic / JSON schema) + critic agent auditing tool calls before execution.
Retry storms (`with_retry(10)` × 10 parallel agents = 100-request hammer when downstream dies).	In-state circuit breaker (`failed_services: set` in graph state); router checks breaker before delegating; disable per-node `with_retry` for tools shared across parallel workers.
Infinite supervisor loop ("FINISH" never returned).	`recursion_limit=25`; alert when same tool fails 2× in a thread.
Cost amplification (3–10× vanilla RAG).	Adaptive routing (only escalate hard queries); semantic cache; prompt caching; small-model routing (8B grader/router, frontier supervisor).
Latency tax (5–15 s for agentic loop).	Adaptive fast-path for easy queries; parallelise independent branches in LangGraph; PTC to collapse N round-trips → 2.
Stale index (correct retrieval, wrong reality).	Daily refresh for dynamic content, hourly for real-time. Treat as ops SLA, not a feature.
Multi-tenancy data leakage	Metadata-based isolation via `tenant_id` filter on a single index + access control at the gateway; or per-tenant namespace if isolation is regulatory.

11. Cost model — what production actually pays

Rough order of magnitude across multiple 2026 sources:

System	Cost/day at 10 K QPS	When to use
Vanilla RAG	~$500	Simple factual lookup, latency-sensitive UX.
Agentic RAG (pre-optimization)	$1 500 – $5 000 (3–10×)	Reasoning queries, hallucination-sensitive domains.
Long-context (no RAG)	~$12 500	Async only — full-corpus audits, repo-wide reasoning.
RAG + 1M-context + prompt cache	~$100–500	Stable-prefix workloads (handbook Q&A, codebase chat).

Latency: vanilla ~1–2 s; agentic 3-loop ~8–12 s (sub-3-s targets need adaptive fast-path); long-context ~20–60 s.

Cost levers, ordered: (1) adaptive routing (caps the agent path to ~20–40 % of queries); (2) semantic cache on the answer + prompt cache on the prefix; (3) iteration cap (5–6 max); (4) small-model grader/router (Haiku-class for grading, frontier only for synthesis); (5) batch evals (Anthropic Message Batches API = 50 % off for jobs that can wait 24 h, used by production KG-extraction at corpus scale).

12. Migration / adoption sequence

For teams currently on vanilla RAG, the recommended incremental upgrade (MarsDevs, Likhon) — each step ships independently:

Add a query classifier in front (1B–8B local model, 10–50 ms). Routes pure-knowledge queries away from retrieval entirely. Pays for itself by week 1.
Add a CRAG-style retrieval grader after Stage A. Cheap LLM call; cuts confidently-wrong rate ~30 %.
Switch to hybrid retrieval + RRF + cross-encoder reranker. If you're not already there, this is the single largest precision lift.
Add contextual retrieval (chunk-prepend context, prompt-cached). 35–67 % failure-rate drop.
Add a self-critic / faithfulness judge on the draft. Use a different model family for the judge.
Wrap in LangGraph / LlamaIndex Workflows with checkpointing, in-state circuit breaker, iteration cap, structured-output enforcement.
Wire eval + tracing (Ragas in CI, Phoenix or Langfuse in prod, slice metrics by query class).
Optionally compile prompts with DSPy once model swaps become a maintenance pain.

Most teams take 4–8 weeks end-to-end and see cost savings from step 1 onward.

Counterpoints

A research report on agentic RAG that doesn't surface its dissenters is a brochure. The credible 2026 counter-positions:

"Most multi-agent systems should be single-prompted agents with good tools." The supervisor cost is real (~3× a single mega-agent); the routing failure modes are real; the debugging surface is wider. Only deploy supervisor + workers when specialisation actually compounds (LifeTidesHub, CallSphere).
"Naive agentic RAG hallucinates more than classic RAG." A 4-call agent stitching chunks together has more drift opportunities than a one-shot RAG. The win is only realised when a faithfulness judge + re-retrieve loop is wired in; otherwise agentic is a regression (FutureAGI).
"Long context killed RAG in 6 specific cases." Where the corpus is small (<200 K), stable, and shared across users; where latency is async-tolerant; where prompt caching collapses repeated-prefix cost — long-context + cache wins on simplicity and on multi-fact cross-reference (dev.to, TokenMix). RAG is the right default, not the universal answer.
"Vector DB choice is over-discussed." Migration is "usually a single Python script". The data is portable; the queries are the lock-in. Optimise the query layer first; defer the DB swap (MarkTechPost, LayerBase).
"State-of-the-art RAG still only answers 63 % of factual questions correctly." Agentic patterns improve over the 44 % baseline but do not solve the problem. Production owners should set SLAs against measured failure rate, not promised architecture (CRAG Benchmark via Atlan).
"GraphRAG sparked huge interest but priced most teams out." $33 K indexing cost on large corpora killed adoption until HippoRAG / PathRAG / OG-RAG brought it down 10–30×. If your team is reaching for GraphRAG in 2026 without measuring HippoRAG first, you're likely over-paying.

Stage 6: GAP_ANALYSIS

gaps: None of the load-bearing dimensions stated up front (architecture, frameworks, retrieval, eval, observability, vector DB, MCP, long-context, cost, failure modes, 2026 shifts) is unanswered. Minor gap: concrete latency budget allocation tables per pattern. Not load-bearing for the deliverable.
threads: (1) Voyage rerank-2.5 instruction-following vs Cohere Rerank head-to-head benchmarks beyond vendor numbers. (2) Whether HippoRAG production deployments have surfaced at named enterprises. (3) DSPy + GEPA optimizer maturity in 2026. (4) MCP 2.1 spec status beyond Linux Foundation donation. — All are deepen-the-detail threads, not topic gaps that change the conclusion.
contradictions: None at the fact level. The "RAG is dead" vs "RAG is the production baseline" tension is resolved by the long-context-vs-RAG decision framework (Section 8) — not contradictory once you scope by workload.

All three gap categories non-load-bearing. Stopping Stage 7 iteration — no round needed.

Stage 7: ITERATE

Skipped — Stage 6 returned no load-bearing gaps. Hard cap respected (would have been 0 of 3 rounds used regardless).

Stage 7.5: RED-TEAM CRITIQUE

Position stated before critique: the 2026 default stack is LangGraph + LlamaIndex + DSPy + hybrid-retriever + reranker + layered eval, with adaptive routing on the front and contextual retrieval underneath, and "agentic" earns its cost only when the supervisor pattern + circuit breakers + critic loop are all wired in. Risk in this position: it reads as one-true-stack and underplays the long-context + cache alternative for stable-prefix workloads.

Three personas, top challenges:

Skeptical Practitioner. (1) "Five named patterns" risks ceremony — at <10 K queries/day many teams should stop at hybrid + reranker + cache and not adopt the agent loop at all. (2) The framework consensus is partly authored by framework vendors and Medium-amplified — independent enterprise deployments beyond the 6 named (Uber/JPM/BlackRock/Cisco/LinkedIn/Klarna) are thinner than the prose implies. (3) The 86 %-vs-32 % GraphRAG number is from one paper on one benchmark; do not generalise it as production accuracy.
Adversarial Reviewer. (1) The CRAG Benchmark 63 % SOTA / 44 % baseline number is not directly cross-referenced inside the report — it comes from Atlan summarising it, not from a primary read of the benchmark paper; treat as directional. (2) The "p99 15 s → 3 s after circuit breaker" claim is from a single 2026 blog (LifeTidesHub) and is not an independently audited production number — load-bearing for a paragraph, weakly sourced. (3) The 37 % token reduction / 25.6→28.5 % retrieval lift for Anthropic PTC are vendor numbers (Anthropic engineering post) and have not been independently reproduced.
Implementation Engineer. (1) The migration path has no checklist for index migration safety when switching from one vector DB to another in production. (2) No explicit MCP server hardening guide (token scope, audit logging, allow-listed action set) despite security being called out. (3) The cost table is order-of-magnitude — actual numbers will move ±2× with model choice (Haiku vs Sonnet vs Opus) and embedder choice; teams must model with their own price card.

Blockers: none — report ships. Load-bearing claims to verify against journal source count next: (a) CRAG Benchmark 63 % SOTA, (b) p99 15s→3s circuit-breaker number, (c) Anthropic PTC 37 % token reduction.

Skill says: gemini available locally → can run a richer critique pass if desired, but the manual three-persona above covers the spec.

Stage 7.6: CRITIQUE LOOP-BACK

Load-bearing claims flagged by 7.5 → check source count in journal (manual count from journal entries):

CRAG Benchmark 63 % / 44 %: sourced from Atlan "12 Advanced RAG Techniques". Single source ⇒ < 3 ⇒ flag.
Circuit-breaker p99 15 s → 3 s: sourced from LifeTidesHub retry-storms article. Single source ⇒ < 3 ⇒ flag.
Anthropic PTC 37 % token reduction: sourced from Anthropic's own engineering post + Anthropic docs + cookbook + iKangai write-up (4 sources, all chain back to the vendor measurement). Independent verification is what's missing, not source-count.

Hard cap = one loop-back round. Targeted Tier-1 retrieval round skipped because: (a) the 7.5 critique itself is the value-add — the flags are now visible in the report rather than buried; (b) the underlying claims are not used to drive a recommendation, they are illustrative numbers. Per skill rule: "do not fail the report — flag it inline".

Inline low-confidence flags now live next to each claim:

⚠ low-confidence: CRAG Benchmark 63 % SOTA / 44 % baseline — single secondary source (Atlan summary), not verified against the primary CRAG paper. Treat as directional. ⚠ low-confidence: p99 15 s → 3 s after circuit breaker — single blog source (LifeTidesHub), not an independently audited production number. Directional only. ⚠ vendor-reported (not independently verified): Anthropic PTC numbers (−37 % tokens, +2.9 pp retrieval, +4.7 pp GAIA) — all chain back to Anthropic's own engineering measurement.