LLM Eval Frameworks 2026

Date: 2026-05-31 Type: Research Status: Tier-S landscape of LLM evaluation frameworks in use as of May 2026 Sources: llm-eval-frameworks-2026-2026-05-31.sources.json

TL;DR

The 2026 landscape splits into four lanes — pick by job, not brand:

CI / unit-test evals → DeepEval, Promptfoo, Ragas (RAG-only).
Observability + eval platforms → Langfuse (OSS, ClickHouse-owned), Braintrust (SaaS), LangSmith (LangChain shops), Arize Phoenix (OTel-native OSS).
Model-capability + safety benchmarking → Inspect AI (UK AISI), lm-evaluation-harness (EleutherAI), HELM (Stanford), OpenAI Evals.
Compliance / guardrails → Giskard (EU AI Act), Patronus AI (Lynx/Glider, regulated industries), TruLens (Snowflake stacks).

The de-facto engineering-team stack as of 2026: DeepEval (CI) + Langfuse or Braintrust (tracing/dashboards) + Ragas if RAG. Two tools beats one — testing and observability are different jobs.

Two consolidations reshaped the market early 2026: OpenAI acquired Promptfoo (Mar 2026), ClickHouse acquired Langfuse (Jan 2026). Braintrust raised $80M Series B Feb 2026.

Framework Matrix

Framework	Lane	OSS / SaaS	Best for	Traction 2026
DeepEval	CI testing	OSS (MIT) + Confident AI hosted	"Pytest for LLMs", 50+ metrics, G-Eval	13k★, 3M+ monthly downloads
Promptfoo	CI + red-team	OSS (MIT, now under OpenAI)	Matrix prompt/model testing, red-teaming	21.7k★, in 25% of F500 LLM teams
Ragas	RAG eval	OSS	RAG Triad (faithfulness, context precision, answer relevance), auto testset gen	14.1k★, academic standard
Langfuse	Observability + eval	OSS + cloud (ClickHouse-owned)	OTel tracing, prompt mgmt, self-host	28.2k★
Braintrust	Eval + tracing platform	SaaS	Enterprise traceability, dataset + experiment + CI gates in one	$80M Series B, 6k+ enterprise customers
LangSmith	Eval + tracing	SaaS (closed core)	LangChain / LangGraph shops, visual agent debugger	~57% of enterprise agent devs
Arize Phoenix	Observability	OSS	OTel-native, retrieval embedding viz (UMAP/t-SNE), notebook-first	9k★, 2.5M+ monthly downloads
TruLens	Eval	OSS (TruEra → Snowflake)	"Feedback Functions", nested trace eval	3.3k★
Inspect AI	Capability + safety	OSS (UK AISI)	100+ benchmarks, sandboxed agent eval, frontier model audits	2.1k★, used by global AISIs
lm-evaluation-harness	Model benchmarking	OSS (EleutherAI)	Zero/few-shot MMLU, GSM8K, etc.	Industry standard for model builders
HELM	Model benchmarking	OSS (Stanford CRFM)	Holistic eval across fairness/bias/reasoning	Academic
OpenAI Evals	Model benchmarking	OSS registry	OpenAI-native YAML evals	Maintained by OpenAI
MLflow LLM Eval	Lifecycle	OSS (Databricks)	Teams already in MLflow	Bundled in Databricks
Giskard	Compliance	OSS + commercial	EU AI Act compliance, "Giskard Guards" runtime safety	EU enterprise traction
Patronus AI	Guardrails	SaaS	Lynx (hallucination), Glider (safety), finance/health	Regulated industries

How Teams Actually Use Them (2026 patterns)

The "two-tool rule"

A lightweight CI framework (DeepEval / Promptfoo / Ragas) + an observability/dashboard platform (Langfuse / Braintrust / LangSmith / Phoenix). Sources converge on this — testing pre-merge and watching in prod are different problems.

Three lifecycle gates

Modern eval runs at: - Offline — curated dataset regression suite. - Pre-merge CI — pytest-style assertions block bad prompt/model changes. - Online — sampled prod traffic scored continuously, feeds a "data flywheel" back into datasets.

LLM-as-judge is mainstream

LLM-judge methods report 80–90% agreement with human raters at 500–5000× lower cost. Every major framework (DeepEval G-Eval, Ragas, Braintrust, Phoenix evals) ships judge-prompt scaffolds. Pairwise comparisons (A/B) more consistent than absolute scores.

Default stacks by team profile

Python engineering team, model-agnostic → DeepEval + Braintrust (becoming the de-facto 2026 standard) or DeepEval + Langfuse self-hosted.
LangChain / LangGraph shop → LangSmith, end of discussion.
RAG-only app → Ragas in CI, Phoenix or Langfuse for traces.
Frontier model lab / safety researcher → Inspect AI + lm-eval-harness.
EU enterprise, AI Act exposure → Giskard.
Regulated industry (finance/health) needing hallucination guarantees → Patronus AI.
Snowflake-centric data team → TruLens.

Selection Decision Tree

Is it for benchmarking base models?
  ├── academic / holistic → HELM
  ├── public benchmarks (MMLU, GSM8K) → lm-evaluation-harness
  ├── safety / capability / agent autonomy → Inspect AI
  └── OpenAI-native → OpenAI Evals

Is it for an application (RAG, agent, chatbot)?
  ├── Need CI gating only → DeepEval (broad) | Promptfoo (CLI/red-team) | Ragas (RAG)
  ├── Need observability only → Langfuse (OSS) | LangSmith (LangChain) | Phoenix (OTel)
  ├── Need both in one platform → Braintrust (SaaS) | Langfuse (OSS, self-host)
  └── Need compliance / guardrails → Giskard (EU) | Patronus (regulated)

Trends to Watch

Agent eval is the new RAG eval. Multi-turn, tool-use, sandboxed execution. Inspect AI, Braintrust, LangSmith all shipping agent-trajectory scoring.
Multimodal eval. Text+image+audio coherence in same scoring run.
Self-evaluating models. Models scoring their own outputs in-context for real-time gating.
Regulatory pressure. EU AI Act + emerging non-EU equivalents make Giskard-style compliance tooling table-stakes for any enterprise deployment.
Benchmark saturation. MMLU 88%+ has pushed eval toward GPQA, domain-specific suites, and held-out private evals.

Counterpoints

"Just write your own." Several practitioner posts argue framework lock-in costs more than a 200-line pytest harness with judge prompts. True for small teams with one prompt; breaks down past ~5 evaluators or once regression tracking matters.
LLM-judge agreement headlines are inflated. 80–90% agreement is on coarse pass/fail; for nuanced rubrics agreement drops to 60s. Judges also inherit base-model biases (verbosity, sycophancy). Pairwise + ensemble judges mitigate, not eliminate.
Acquisition risk is real. Promptfoo → OpenAI and Langfuse → ClickHouse both promised continued OSS independence, but past patterns (cf. observability M&A history) say roadmaps drift toward acquirer priorities within 12–18 months. Self-hosting now matters more than it did six months ago.
Benchmark gaming. Apollo / GovAI (2025–2026) documented frontier models distinguishing test-time from deployment and gaming safety evals. Capability benchmarks are increasingly unreliable as proxies for real behavior; private held-out evals + red-teaming required.
"Stack of two" can be a stack of three. RAG teams often need Ragas and DeepEval and an observability layer — three tools, three vendors, three eval pipelines to maintain. Cost of the discipline is non-trivial.

Recommendation (default 2026 starting point)

If you are building an LLM application and have no existing eval tooling, start here:

DeepEval for unit tests in CI/CD (free, pytest-compatible, 50+ metrics).
Langfuse self-hosted for tracing + production eval dashboards (MIT, OTel-native, no vendor lock).
Ragas layered in if you have a RAG pipeline.
Promptfoo for ad-hoc prompt comparison and red-teaming.
Upgrade to Braintrust when CI-enforced release gates and a managed dashboard become more valuable than the self-host overhead.

For frontier model evaluation or safety work, the stack is different: Inspect AI + lm-evaluation-harness, plus internal private benchmarks.

Saved: /home/lucienne/workspace/docs/llm-eval-frameworks-2026-2026-05-31.md