⌂ Home ☷ Board

LLM Eval Frameworks 2026

Date: 2026-05-31 Type: Research Status: Tier-S landscape of LLM evaluation frameworks in use as of May 2026 Sources: llm-eval-frameworks-2026-2026-05-31.sources.json

TL;DR

The 2026 landscape splits into four lanes — pick by job, not brand:

  1. CI / unit-test evals → DeepEval, Promptfoo, Ragas (RAG-only).
  2. Observability + eval platforms → Langfuse (OSS, ClickHouse-owned), Braintrust (SaaS), LangSmith (LangChain shops), Arize Phoenix (OTel-native OSS).
  3. Model-capability + safety benchmarking → Inspect AI (UK AISI), lm-evaluation-harness (EleutherAI), HELM (Stanford), OpenAI Evals.
  4. Compliance / guardrails → Giskard (EU AI Act), Patronus AI (Lynx/Glider, regulated industries), TruLens (Snowflake stacks).

The de-facto engineering-team stack as of 2026: DeepEval (CI) + Langfuse or Braintrust (tracing/dashboards) + Ragas if RAG. Two tools beats one — testing and observability are different jobs.

Two consolidations reshaped the market early 2026: OpenAI acquired Promptfoo (Mar 2026), ClickHouse acquired Langfuse (Jan 2026). Braintrust raised $80M Series B Feb 2026.

Framework Matrix

Framework Lane OSS / SaaS Best for Traction 2026
DeepEval CI testing OSS (MIT) + Confident AI hosted "Pytest for LLMs", 50+ metrics, G-Eval 13k★, 3M+ monthly downloads
Promptfoo CI + red-team OSS (MIT, now under OpenAI) Matrix prompt/model testing, red-teaming 21.7k★, in 25% of F500 LLM teams
Ragas RAG eval OSS RAG Triad (faithfulness, context precision, answer relevance), auto testset gen 14.1k★, academic standard
Langfuse Observability + eval OSS + cloud (ClickHouse-owned) OTel tracing, prompt mgmt, self-host 28.2k★
Braintrust Eval + tracing platform SaaS Enterprise traceability, dataset + experiment + CI gates in one $80M Series B, 6k+ enterprise customers
LangSmith Eval + tracing SaaS (closed core) LangChain / LangGraph shops, visual agent debugger ~57% of enterprise agent devs
Arize Phoenix Observability OSS OTel-native, retrieval embedding viz (UMAP/t-SNE), notebook-first 9k★, 2.5M+ monthly downloads
TruLens Eval OSS (TruEra → Snowflake) "Feedback Functions", nested trace eval 3.3k★
Inspect AI Capability + safety OSS (UK AISI) 100+ benchmarks, sandboxed agent eval, frontier model audits 2.1k★, used by global AISIs
lm-evaluation-harness Model benchmarking OSS (EleutherAI) Zero/few-shot MMLU, GSM8K, etc. Industry standard for model builders
HELM Model benchmarking OSS (Stanford CRFM) Holistic eval across fairness/bias/reasoning Academic
OpenAI Evals Model benchmarking OSS registry OpenAI-native YAML evals Maintained by OpenAI
MLflow LLM Eval Lifecycle OSS (Databricks) Teams already in MLflow Bundled in Databricks
Giskard Compliance OSS + commercial EU AI Act compliance, "Giskard Guards" runtime safety EU enterprise traction
Patronus AI Guardrails SaaS Lynx (hallucination), Glider (safety), finance/health Regulated industries

How Teams Actually Use Them (2026 patterns)

The "two-tool rule"

A lightweight CI framework (DeepEval / Promptfoo / Ragas) + an observability/dashboard platform (Langfuse / Braintrust / LangSmith / Phoenix). Sources converge on this — testing pre-merge and watching in prod are different problems.

Three lifecycle gates

Modern eval runs at: - Offline — curated dataset regression suite. - Pre-merge CI — pytest-style assertions block bad prompt/model changes. - Online — sampled prod traffic scored continuously, feeds a "data flywheel" back into datasets.

LLM-as-judge is mainstream

LLM-judge methods report 80–90% agreement with human raters at 500–5000× lower cost. Every major framework (DeepEval G-Eval, Ragas, Braintrust, Phoenix evals) ships judge-prompt scaffolds. Pairwise comparisons (A/B) more consistent than absolute scores.

Default stacks by team profile

Selection Decision Tree

Is it for benchmarking base models?
  ├── academic / holistic → HELM
  ├── public benchmarks (MMLU, GSM8K) → lm-evaluation-harness
  ├── safety / capability / agent autonomy → Inspect AI
  └── OpenAI-native → OpenAI Evals

Is it for an application (RAG, agent, chatbot)?
  ├── Need CI gating only → DeepEval (broad) | Promptfoo (CLI/red-team) | Ragas (RAG)
  ├── Need observability only → Langfuse (OSS) | LangSmith (LangChain) | Phoenix (OTel)
  ├── Need both in one platform → Braintrust (SaaS) | Langfuse (OSS, self-host)
  └── Need compliance / guardrails → Giskard (EU) | Patronus (regulated)

Trends to Watch

Counterpoints

Recommendation (default 2026 starting point)

If you are building an LLM application and have no existing eval tooling, start here:

  1. DeepEval for unit tests in CI/CD (free, pytest-compatible, 50+ metrics).
  2. Langfuse self-hosted for tracing + production eval dashboards (MIT, OTel-native, no vendor lock).
  3. Ragas layered in if you have a RAG pipeline.
  4. Promptfoo for ad-hoc prompt comparison and red-teaming.
  5. Upgrade to Braintrust when CI-enforced release gates and a managed dashboard become more valuable than the self-host overhead.

For frontier model evaluation or safety work, the stack is different: Inspect AI + lm-evaluation-harness, plus internal private benchmarks.

Saved: /home/lucienne/workspace/docs/llm-eval-frameworks-2026-2026-05-31.md