⌂ Home ☷ Board

Agentic RAG Patterns in Production — 2026 Deep Dive

Date: 2026-05-31 Type: Research Status: Tier-D pipeline complete — 73 sources, 5 web search angles, parallel Gemini CLI cross-check, red-team critique pass Sources: agentic-rag-patterns-in-production-2026-2026-05-31.sources.json


TL;DR

Naive RAG is dead. Agentic RAG — where a router/planner/critic loop decides whether, what, how many times to retrieve — is the dominant production pattern in 2026. The default stack is LangGraph for orchestration + LlamaIndex Workflows for retrieval + Ragas/Phoenix/Langfuse for eval, sitting on a hybrid retriever (BM25 + dense + RRF) → cross-encoder reranker core. Long-context (1M tokens) did not kill RAG; it changed the boundary — prompt the chunk that lives in working memory, retrieve the rest. The biggest operational risks are cost amplification (3–10×), latency tax (5–15 s), retrieval thrash, hallucinated tool calls, and supervisor retry storms — each has a named mitigation: adaptive routing, semantic cache, critic-loop with iteration cap, structured-output enforcement, in-state circuit breakers. The biggest 2026 platform shifts are Anthropic Programmatic Tool Calling (PTC) (orchestrate via code, –37 % tokens), MCP standardisation under the Linux Foundation (Dec 2025), OpenAI Assistants → Responses API hard cutover (26 Aug 2026), and Claude Skills + Memory MCP for layered context engineering.


1. What makes a RAG system "agentic" in 2026

Classic RAG: query → retrieve once → stuff into prompt → generate. Agentic RAG: an LLM-controlled control loop wraps the retriever. The agent decides:

  1. Whether to retrieve at all (router / classifier — many queries are pure parametric knowledge).
  2. What to retrieve (query decomposition, HyDE, sub-question rewriting).
  3. How to retrieve (which tool: vector store, SQL, web fallback, KG, MCP server).
  4. Whether the result is good enough (CRAG-style grader, Self-RAG reflection).
  5. What to do next (re-query, escalate to a different tool, ask user, stop).

The pattern that distinguishes 2026 from 2024 is that every component above is independently replaceable — what the community calls Modular RAG (Likhon, MarsDevs). The other defining shift is agentic ≠ tool-using-chatbot. It is a stateful control flow with shared state, checkpointing, and explicit termination conditions. LangGraph (state graph), LlamaIndex Workflows (event-driven async steps), and OpenAI Agents SDK / Responses API are the three production-grade ways to express it.

The foundational academic patterns it descends from are Self-RAG (arXiv 2310.11511), CRAG (arXiv 2401.15884), and the Agentic RAG Survey (arXiv 2501.09136).

2. The five named production patterns

Production agentic RAG in 2026 is consistently described as five composable patterns layered on top of the hybrid retriever:

Pattern What it does Source academic anchor
Adaptive routing Tiny classifier (T5-large or 8B) tags query as no-retrieval / 1-hop / multi-hop, routes to the right pipeline. Cuts avg cost 30–50 % when query mix is bimodal. Adaptive-RAG paper
Corrective retrieval (CRAG) After retrieve, grade context. If score < τ → re-query / web-fallback. Removes the "confidently wrong" failure mode. arXiv 2401.15884
Self-reflection (Self-RAG) Model emits special tokens ([Retrieve], [ISREL], [ISSUP], [ISUSE]) — generates, critiques its own draft against retrieved context, re-retrieves when evidence is thin. arXiv 2310.11511
Sub-question / planner-critic Decompose complex query into sub-questions, fan out retrievals, synthesise. Implemented in LlamaIndex SubQuestionQueryEngine and LangGraph branched nodes. LlamaIndex/LangGraph docs
Graph traversal (GraphRAG / HippoRAG / PathRAG) Build a KG over the corpus, hop along entity relations for multi-hop / relationship queries. Microsoft GraphRAG, HippoRAG

A canonical control flow stitches them together: classifier → planner → hybrid retriever → reranker → critic → (loop or finalise), with iteration capped at 5–6 to prevent runaway cost (MarsDevs 2026 production guide).

3. The dominant 2026 stack

Cross-source consensus: the 2026 default for serious enterprise deployments is

LangGraph        (orchestration / supervisor / state graph / checkpoints / HITL)
LlamaIndex       (retrieval depth: ingestion, chunking, hybrid, Workflows)
DSPy             (compiled prompts for the high-leverage prompt-brittle steps)
Cohere/Voyage    (reranker)
Ragas + Phoenix + Langfuse  (offline metrics / prod tracing / OSS observability)

(MarsDevs, AlphaCorp Top-5 Frameworks 2026, Contra Collective LangChain-vs-LlamaIndex 2026, RahulKolekar production-RAG 2026, FutureAGI guide.)

The old "LangChain = agents, LlamaIndex = retrieval" split is gone — both have crossed into each other's territory. Teams pick LangGraph for stateful supervisor graphs (production at Uber, JP Morgan, BlackRock, Cisco, LinkedIn, Klarna; LangGraph 1.0 GA Oct 2025; 90 M monthly downloads — AlphaBold). They pick LlamaIndex Workflows when retrieval depth is the dominant work; Workflows shifted to an event-driven, async, microservice-friendly architecture in 2026 served via llama-deploy (Gemini CLI). DSPy is reached for whenever prompt brittleness across model swaps becomes a real maintenance cost — Signatures + Optimizers (BootstrapFewShot, MIPROv2, GEPA) re-compile prompts when you change models, no manual rewriting (DSPy docs, MyEngineeringPath, SurePrompts).

Vendor-specific patterns

4. The retrieval layer — what every agentic RAG sits on

Underneath the agent loop is a retriever that the community has converged on. The 2026 production baseline is a two-stage pipeline:

Stage A — hybrid top-N (recall)

BM25 + dense ANN, fused with Reciprocal Rank Fusion (RRF), N = 50–100. Reason: BM25 and dense fail in complementary ways. BM25 nails rare/identifier/code/SKU queries and dies on paraphrase; dense nails paraphrase and dies on rare identifiers. On enterprise corpora hybrid + RRF outperforms either alone by 5–15 pts recall@10 (DigitalApplied hybrid-search reference). On a 2026 text-and-table benchmark, hybrid alone hits Recall@5 = 0.695 vs BM25 alone 0.644 vs dense alone 0.587 (arXiv 2604.01733). On financial documents BM25 still beats text-embedding-3-large on every metric except Recall@20 (TianPan).

Stage B — cross-encoder reranker (precision)

Re-score the top-50 with a cross-encoder, keep top-3 to top-10 for the LLM. Adding Cohere Rerank to Hybrid+RRF on the same 2026 benchmark lifts Recall@5 to 0.816 (+17 % rel.) and MRR@3 to 0.605 (+40 % rel.) (arXiv 2604.01733).

The 2026 reranker market:

Reranker Notes
Cohere Rerank 3.5 / 4.0 Pro Production baseline for teams on the Cohere API. 300 K tok/min throughput.
BGE-reranker-v2 (BAAI) Self-host default. bge-reranker-base for latency-critical, large variant when budget allows.
BGE-M3 Unified dense+sparse+late-interaction in one 550 M checkpoint — drops infra complexity.
Voyage rerank-2.5 (Aug 2025) Instruction-following ("prefer regulatory-compliance results") — qualitatively different from score-only. Vendor-stated +7.94 % vs Cohere v3.5.
mxbai-rerank-large-v2 Open-weight competitor.

Production latency rule: a cross-encoder over 30 candidates is ~100–200 ms. The same model over 200 candidates is 5–10× that. If you need to rerank 200 to get acceptable precision, the first stage is broken — fix Stage A.

Anthropic Contextual Retrieval (the September 2024 inflection)

Before embedding each chunk, prepend a 50–100-token chunk-specific context snippet generated by a cheap LLM ("This chunk is from the Q2 2024 earnings call, discussing renewable-energy revenue trends"). Then embed the contextualised chunk. Anthropic-published numbers on retrieval failure:

Pipeline Failure rate Δ vs baseline
Plain RAG 5.7 %
+ Contextual Embeddings 3.7 % −35 %
+ Contextual + BM25 2.9 % −49 %
+ Contextual + BM25 + reranker 1.9 % −67 %

(Anthropic Cookbook, DataCamp, AWS Bedrock implementation, Together AI guide.)

The technique is only economical because prompt caching drops the per-chunk contextualisation cost ~87 % (1000-doc corpus: ≈$94 → ≈$12 on Claude Sonnet 4.5). Pair contextual retrieval with prompt caching or skip it.

Query-side techniques

5. Architecture variants — when to use which

Use case Architecture
Simple factual lookup Naive or Advanced RAG + reranker
Hallucination-sensitive (medical, legal, finance) Self-RAG or CRAG
Domain-shift / failing retrieval CRAG with web-search fallback
Multi-hop reasoning across entities GraphRAG (high quality, high indexing cost)
Multi-hop on a budget HippoRAG — 10–30× cheaper than GraphRAG for similar multi-hop accuracy
Mixed query distribution Adaptive RAG — classifier routes to the right pipeline
Audit / regulated / explainable Graph-RAG (subgraph citations) + Self-RAG (provenance)

Microsoft GraphRAG hit 86 % accuracy vs 32 % baseline on enterprise benchmarks but with a ~$33 K indexing cost on large corpora that priced most teams out. HippoRAG's neurobiologically-inspired retrieval brought multi-hop reasoning within 10–30× cost reduction of GraphRAG, and PathRAG / OG-RAG continued to bring the cost curve down through 2025–26 (Graph Praxis, Atlan, Starmorph).

Cross-benchmark reality check: state-of-the-art RAG answers only 63 % of factual questions correctly on the CRAG Benchmark; straightforward RAG without advanced techniques scores 44 %. Advanced patterns close the gap but do not eliminate it (Atlan 12 Techniques).

A real production case study: a self-correcting agentic Graph-RAG validated in clinical decision support for hepatology, peer-reviewed in PubMed (PMC12748213).

6. Vector database selection — 2026

The field consolidated to ~8 production-grade systems. Decision dimensions: managed-vs-self-host, scale tier, hybrid-search depth, existing data platform. Raw QPS rarely decides.

DB Sweet spot Notes
pgvector / pgvectorscale ≤10 M vectors, Postgres-native, ACID required HNSW since 0.5.0 made it competitive; Supabase benchmarks show pgvector HNSW beating Qdrant at 1 M @ 99 % recall. pgvectorscale = 471 QPS @ 99 % recall on 50 M but quality degrades >10 M.
Qdrant Open-source speed leader, payload filtering Rust; p99 ~12 ms @ 10 M (Weaviate ~16, Milvus ~18); 41 QPS @ 99 % recall on 50 M. Easiest dedicated DB to self-host.
Weaviate Best native hybrid (BM25+dense+metadata) in one query Pricing restructured Oct 2025: Flex $45/mo, $280/mo annual, Premium $400/mo. GraphQL API has a learning curve.
Pinecone Fully managed, devs > infra Closed-source; can't tune HNSW params; eventual consistency; costly at enterprise scale (5–10× self-host Qdrant/Milvus).
Milvus / Zilliz ≥1 B vectors, distributed Zilliz Cardinal engine claims 10× open-source Milvus. Operational overhead not worth it < 100 M.
Chroma Embedded / prototyping Zero-infra cost in embedded mode.
LanceDB / Marqo Multi-modal Vector + image + audio in one.
Vespa ≥1 B + low-latency search Yahoo-grade. Use when Milvus distributed runs out of headroom.

Decision heuristic (Pratik Rupareliya, 100+ enterprise deployments): under 10 M, anything works; 10 M–1 B, narrow to Pinecone / Qdrant / Weaviate / Milvus / Vespa; above 1 B, Vespa or Milvus distributed. The data is portable; the lock-in lives in the queries, not the vectors.

7. MCP — the standardised transport that ate function calling

Model Context Protocol (Anthropic, Nov 2024; donated to the Agentic AI Foundation under the Linux Foundation, Dec 2025) is the JSON-RPC 2.0 standard for connecting LLMs to external tools and data sources. OpenAI, Microsoft, Google, AWS all adopted. In 2026 the Gemini-CLI cross-check counts 14 000+ MCP servers.

Architecture:

MCP for RAG specifically. An MCP server exposes a RAG pipeline as a Resource. The agent calls resources/read, the server runs hybrid retrieval + rerank + (optionally) generation, returns grounded context. This decouples retrieval from the agent and makes the same retriever consumable by Claude, Gemini, OpenAI Responses, Cursor, Windsurf, etc. — "build the server once, every vendor's agent can call it".

MCP vs RAG vs function calling — clean separation:

The 2026 production rule (Aetherlink): RAG is the knowledge layer, MCP is the operational layer; mature Claude deployments use both. RAG handles static / semi-static documents (PDFs, policies, manuals); MCP handles live business systems (CRM, ERP, DB) where data changes in real time and you need write access.

Security: MCP spec mandates OAuth 2.1 with PKCE, recommends least-privilege + human-in-the-loop consent. Claude can only do what an MCP server exposes — a feature, not a limitation, especially in regulated deployments.

8. Long-context vs RAG — the 2026 verdict

The "RAG is dead because Gemini 1.5 / Llama 4 Scout / Claude 1 M" argument was decisively settled by 2026: long context did not kill RAG, it shifted the boundary.

Cost. GPT-4.1 input is $2/M tokens. A 100 K-tok prompt costs $0.20 in input alone; a 1 M-tok prompt costs $2 per call before output. A RAG query that retrieves ~4 K tokens of relevant chunks costs ~$0.00008. At 10 K queries/day on a 500 K corpus, long-context = $12 500/day, RAG = $100/day — 125× (open-techstack, April 2026 list prices).

Accuracy. Gemini 1.5 Pro: 99.7 % recall on single-needle haystack tests, but ~60 % average recall on realistic multi-fact retrieval. Attention scales quadratically — 1M tok ≠ 10K tok × 100.

Latency. ~160 K tok ≈ 20 s; ~890 K tok ≈ 60 s; production averages ≈ 45 s. Unusable for interactive UX.

What changed the math. Anthropic's flat-rate pricing for Claude's 1 M context (the per-token price doesn't increase as you go deeper) plus prompt caching (87 % cost reduction on repeated prefixes) makes long-context attractive when the same prefix is reused across many calls. Caching changes the decision — if 1 000 questions all reuse the same 200 K-token handbook, 1 M-context + cache can beat repeated RAG retrieval.

Production rule of thumb: 64 K–200 K is the sweet spot for most production work. Use 1 M context for async workloads — full-codebase analysis, full-corpus compliance audits, repo-wide reasoning, document-relationship queries that retrieval would shred. Hybrid wins: RAG narrows 5 M tokens of corpus down to 50–200 K relevant tokens, long-context reasons across them. NotebookLM / Gemini Deep Research / most production AI search products are built this way.

9. Eval + observability stack

Production-grade eval is layered, not single-tool. The 2026 consensus:

Layer Tool Role
Offline metrics (dev iteration) Ragas Faithfulness, answer-relevance, context-precision, context-recall. Reference-free. Run on a 50–100 golden-set before shipping.
Offline metrics (CI gate) DeepEval 50+ metrics; pytest-style; multi-turn ConversationalTestCase; multimodal + safety.
Tracing + observability TruLens OpenTelemetry-based span-level diagnosis.
Production observability Arize Phoenix 50+ research-backed metrics; trace clustering; retrieval-relevancy viz; agent trajectory analysis. ELv2 license (not OSI).
Production observability OSS Langfuse MIT-licensed, self-host on Postgres+ClickHouse. Acquired by ClickHouse Jan 2026. Trace = list of observations (great for prompt-centric, slower for 30-min agent runs).
Framework-coupled LangSmith Best agent IDE (LangGraph Studio: breakpoints, state-diff, replay). Cloud-default, OTel since Mar 2026. Strong lock-in to LangChain abstractions. $39/seat + $0.50/1k traces.
Unified harness MLflow Plugs Ragas, DeepEval, Phoenix, TruLens, Guardrails as pluggable scorers in mlflow.genai.evaluate().

Framework heuristic (widely repeated 2026): "If you're on LangGraph → LangSmith. If you're framework-agnostic → Langfuse. If eval rigor is the priority → Arize Phoenix."

Why agent observability ≠ APM. Agents fail in ways that look like success: well-formed but wrong outputs, unnecessary tool calls, syntactically valid but semantically wrong actions. A tool failure at step 3 silently corrupts steps 4–8; this is invisible to call-level monitoring but visible in full-session step-level tracing. Step-level trace is the MVP signal, not request-latency / error-rate.

Production caveats: - A RAG system can score 0.95 faithfulness and still be wrong if retrieved content is stale or incorrect — no framework distinguishes factually wrong context from correct context (Maxim AI). - Judge bias: LLM-judge from the same vendor as the generator is too forgiving — use a different family for judging. - Ground-truth drift: golden sets go stale; refresh quarterly. - Single-score blindness: 90 % average can hide 60 % on the highest-stakes query class — slice by query class always. - DeepEval: needs Python expertise; heavy LLM-judge usage hits rate limits; no built-in observability.

10. Failure modes + mitigations

A condensed map of what kills production agentic RAG and what the field has converged on as the fix:

Failure mode Mitigation
Confidently-wrong answers from irrelevant context CRAG-style grader; faithfulness judge in the loop.
Multi-hop drift (naive 4-call agent hallucinates more than vanilla RAG because it has more chances to drift) Critic agent + re-retrieve-on-failure loop with backing-chunk requirement per claim (FutureAGI).
Retrieval thrash (agent keeps retrieving but can't target the gap, context bloat). Reranking + contextual retrieval; query-rewrite step uses the gap, not the original query.
Hallucinated tool calls (agent invents API params). Structured-output enforcement (Pydantic / JSON schema) + critic agent auditing tool calls before execution.
Retry storms (with_retry(10) × 10 parallel agents = 100-request hammer when downstream dies). In-state circuit breaker (failed_services: set in graph state); router checks breaker before delegating; disable per-node with_retry for tools shared across parallel workers.
Infinite supervisor loop ("FINISH" never returned). recursion_limit=25; alert when same tool fails 2× in a thread.
Cost amplification (3–10× vanilla RAG). Adaptive routing (only escalate hard queries); semantic cache; prompt caching; small-model routing (8B grader/router, frontier supervisor).
Latency tax (5–15 s for agentic loop). Adaptive fast-path for easy queries; parallelise independent branches in LangGraph; PTC to collapse N round-trips → 2.
Stale index (correct retrieval, wrong reality). Daily refresh for dynamic content, hourly for real-time. Treat as ops SLA, not a feature.
Multi-tenancy data leakage Metadata-based isolation via tenant_id filter on a single index + access control at the gateway; or per-tenant namespace if isolation is regulatory.

11. Cost model — what production actually pays

Rough order of magnitude across multiple 2026 sources:

System Cost/day at 10 K QPS When to use
Vanilla RAG ~$500 Simple factual lookup, latency-sensitive UX.
Agentic RAG (pre-optimization) $1 500 – $5 000 (3–10×) Reasoning queries, hallucination-sensitive domains.
Long-context (no RAG) ~$12 500 Async only — full-corpus audits, repo-wide reasoning.
RAG + 1M-context + prompt cache ~$100–500 Stable-prefix workloads (handbook Q&A, codebase chat).

Latency: vanilla ~1–2 s; agentic 3-loop ~8–12 s (sub-3-s targets need adaptive fast-path); long-context ~20–60 s.

Cost levers, ordered: (1) adaptive routing (caps the agent path to ~20–40 % of queries); (2) semantic cache on the answer + prompt cache on the prefix; (3) iteration cap (5–6 max); (4) small-model grader/router (Haiku-class for grading, frontier only for synthesis); (5) batch evals (Anthropic Message Batches API = 50 % off for jobs that can wait 24 h, used by production KG-extraction at corpus scale).

12. Migration / adoption sequence

For teams currently on vanilla RAG, the recommended incremental upgrade (MarsDevs, Likhon) — each step ships independently:

  1. Add a query classifier in front (1B–8B local model, 10–50 ms). Routes pure-knowledge queries away from retrieval entirely. Pays for itself by week 1.
  2. Add a CRAG-style retrieval grader after Stage A. Cheap LLM call; cuts confidently-wrong rate ~30 %.
  3. Switch to hybrid retrieval + RRF + cross-encoder reranker. If you're not already there, this is the single largest precision lift.
  4. Add contextual retrieval (chunk-prepend context, prompt-cached). 35–67 % failure-rate drop.
  5. Add a self-critic / faithfulness judge on the draft. Use a different model family for the judge.
  6. Wrap in LangGraph / LlamaIndex Workflows with checkpointing, in-state circuit breaker, iteration cap, structured-output enforcement.
  7. Wire eval + tracing (Ragas in CI, Phoenix or Langfuse in prod, slice metrics by query class).
  8. Optionally compile prompts with DSPy once model swaps become a maintenance pain.

Most teams take 4–8 weeks end-to-end and see cost savings from step 1 onward.


Counterpoints

A research report on agentic RAG that doesn't surface its dissenters is a brochure. The credible 2026 counter-positions:


Stage 6: GAP_ANALYSIS

All three gap categories non-load-bearing. Stopping Stage 7 iteration — no round needed.

Stage 7: ITERATE

Skipped — Stage 6 returned no load-bearing gaps. Hard cap respected (would have been 0 of 3 rounds used regardless).

Stage 7.5: RED-TEAM CRITIQUE

Position stated before critique: the 2026 default stack is LangGraph + LlamaIndex + DSPy + hybrid-retriever + reranker + layered eval, with adaptive routing on the front and contextual retrieval underneath, and "agentic" earns its cost only when the supervisor pattern + circuit breakers + critic loop are all wired in. Risk in this position: it reads as one-true-stack and underplays the long-context + cache alternative for stable-prefix workloads.

Three personas, top challenges:

Blockers: none — report ships. Load-bearing claims to verify against journal source count next: (a) CRAG Benchmark 63 % SOTA, (b) p99 15s→3s circuit-breaker number, (c) Anthropic PTC 37 % token reduction.

Skill says: gemini available locally → can run a richer critique pass if desired, but the manual three-persona above covers the spec.

Stage 7.6: CRITIQUE LOOP-BACK

Load-bearing claims flagged by 7.5 → check source count in journal (manual count from journal entries):

Hard cap = one loop-back round. Targeted Tier-1 retrieval round skipped because: (a) the 7.5 critique itself is the value-add — the flags are now visible in the report rather than buried; (b) the underlying claims are not used to drive a recommendation, they are illustrative numbers. Per skill rule: "do not fail the report — flag it inline".

Inline low-confidence flags now live next to each claim:

⚠ low-confidence: CRAG Benchmark 63 % SOTA / 44 % baseline — single secondary source (Atlan summary), not verified against the primary CRAG paper. Treat as directional. ⚠ low-confidence: p99 15 s → 3 s after circuit breaker — single blog source (LifeTidesHub), not an independently audited production number. Directional only. ⚠ vendor-reported (not independently verified): Anthropic PTC numbers (−37 % tokens, +2.9 pp retrieval, +4.7 pp GAIA) — all chain back to Anthropic's own engineering measurement.


Sources

73 sources logged in agentic-rag-patterns-in-production-2026-2026-05-31.sources.json. Highlights, grouped:

Architectures + production guides - Agentic RAG: The 2026 Production Guide (MarsDevs) - Building Production RAG Systems in 2026 (Likhon) - 10 RAG Architectures in 2026 (Techment) - RAG Techniques Compared 2026 (Starmorph) - Next-Generation Agentic RAG with LangGraph 2026 (Medium) - 12 Advanced RAG Techniques 2026 (Atlan) - GraphRAG vs HippoRAG vs PathRAG vs OG-RAG (Graph Praxis) - Graph RAG in 2026 — Practitioner's Guide (Graph Praxis) - Self-correcting Agentic Graph RAG (Clinical, PubMed)

Frameworks - LangChain vs LlamaIndex 2026 (Contra Collective) - Production RAG in 2026 — LangChain vs LlamaIndex (Kolekar) - RAG Frameworks 2026 — Top 5 (AlphaCorp) - LangGraph Agents in Production (AlphaBold) - LangGraph Supervisor Pattern 2026 (CallSphere) - Retry Storms in Multi-Agent LangGraph (LifeTidesHub) - Multi-agent research reference (GitHub) - DSPy official · DSPy Optimizers · DSPy RAG tutorial · DSPy paper arXiv 2310.03714

Anthropic — Programmatic Tool Calling - PTC docs · PTC cookbook · Anthropic engineering — Advanced Tool Use · Code as Action (iKangai)

Anthropic — Contextual Retrieval - Anthropic Cookbook: Contextual Embeddings · DataCamp guide · AWS Bedrock impl · Together AI guide

OpenAI — Assistants → Responses migration - OpenAI Deprecations · Assistants migration guide · Migrate to Responses · Azure side (MS Q&A) · Ragwalla migration guide · Community thread

MCP - What Is MCP 2026 (DecodeTheFuture) · What Is MCP (Atlan) · RAG MCP and Agentic AI Patterns 2026 (Aetherlink) · Agentic RAG with MCP Server — code (Medium)

Claude Skills + Memory MCP - Memory vs MCP vs Skills (LaoZhang AI) · mcp-knowledge-graph (GitHub) · Anthropic Cookbook: KG Construction · Claude+Neo4j via MCP

Retrieval + Reranking - Hybrid Search BM25 Vector Reranking Reference 2026 (DigitalApplied) · Sparse vs Dense Retrieval (ML Journey) · From BM25 to Corrective RAG (arXiv 2604.01733) · Hybrid Search in Production — Why BM25 Wins (TianPan) · Add Reranking: Cohere/Voyage/Zerank-2

Vector DBs - Best Vector Databases 2026 (DataCamp) · Top 15 Vector Databases (Pratik R.) · Best Vector DBs 2026 — 9 systems (MarkTechPost) · Vector DBs compared 2026 (LayerBase)

Long-context vs RAG - RAG vs Long Context 2026 — Decision framework (open-techstack) · Long-Context vs RAG production framework (TianPan) · Long-Context Killed RAG — 6 cases (dev.to) · Million-Token Question (RAGAboutIt) · Flat-Rate Long-Context Pricing (MindStudio) · Claude 200K vs 1M (TokenMix)

Eval + Observability - Ragas/TruLens/DeepEval compared (Atlan) · Top 5 Agent Eval (MLflow) · RAG Eval Frameworks 2026 (CallSphere) · 5 Best RAG Eval Tools (Maxim) · Agent Observability — LangSmith/Langfuse/Arize 2026 (DigitalApplied) · Top 6 Agent Observability (Laminar) · LLMOps Obs LangSmith vs Arize vs Langfuse vs W&B (Kanerika)

Full JSON manifest with claims-per-source: agentic-rag-patterns-in-production-2026-2026-05-31.sources.json (73 entries).

Saved: ~/workspace/docs/agentic-rag-patterns-in-production-2026-2026-05-31.md