Date: 2026-05-31 Type: Research Status: Tier-S landscape of LLM evaluation frameworks in use as of May 2026 Sources: llm-eval-frameworks-2026-2026-05-31.sources.json
The 2026 landscape splits into four lanes — pick by job, not brand:
The de-facto engineering-team stack as of 2026: DeepEval (CI) + Langfuse or Braintrust (tracing/dashboards) + Ragas if RAG. Two tools beats one — testing and observability are different jobs.
Two consolidations reshaped the market early 2026: OpenAI acquired Promptfoo (Mar 2026), ClickHouse acquired Langfuse (Jan 2026). Braintrust raised $80M Series B Feb 2026.
| Framework | Lane | OSS / SaaS | Best for | Traction 2026 |
|---|---|---|---|---|
| DeepEval | CI testing | OSS (MIT) + Confident AI hosted | "Pytest for LLMs", 50+ metrics, G-Eval | 13k★, 3M+ monthly downloads |
| Promptfoo | CI + red-team | OSS (MIT, now under OpenAI) | Matrix prompt/model testing, red-teaming | 21.7k★, in 25% of F500 LLM teams |
| Ragas | RAG eval | OSS | RAG Triad (faithfulness, context precision, answer relevance), auto testset gen | 14.1k★, academic standard |
| Langfuse | Observability + eval | OSS + cloud (ClickHouse-owned) | OTel tracing, prompt mgmt, self-host | 28.2k★ |
| Braintrust | Eval + tracing platform | SaaS | Enterprise traceability, dataset + experiment + CI gates in one | $80M Series B, 6k+ enterprise customers |
| LangSmith | Eval + tracing | SaaS (closed core) | LangChain / LangGraph shops, visual agent debugger | ~57% of enterprise agent devs |
| Arize Phoenix | Observability | OSS | OTel-native, retrieval embedding viz (UMAP/t-SNE), notebook-first | 9k★, 2.5M+ monthly downloads |
| TruLens | Eval | OSS (TruEra → Snowflake) | "Feedback Functions", nested trace eval | 3.3k★ |
| Inspect AI | Capability + safety | OSS (UK AISI) | 100+ benchmarks, sandboxed agent eval, frontier model audits | 2.1k★, used by global AISIs |
| lm-evaluation-harness | Model benchmarking | OSS (EleutherAI) | Zero/few-shot MMLU, GSM8K, etc. | Industry standard for model builders |
| HELM | Model benchmarking | OSS (Stanford CRFM) | Holistic eval across fairness/bias/reasoning | Academic |
| OpenAI Evals | Model benchmarking | OSS registry | OpenAI-native YAML evals | Maintained by OpenAI |
| MLflow LLM Eval | Lifecycle | OSS (Databricks) | Teams already in MLflow | Bundled in Databricks |
| Giskard | Compliance | OSS + commercial | EU AI Act compliance, "Giskard Guards" runtime safety | EU enterprise traction |
| Patronus AI | Guardrails | SaaS | Lynx (hallucination), Glider (safety), finance/health | Regulated industries |
A lightweight CI framework (DeepEval / Promptfoo / Ragas) + an observability/dashboard platform (Langfuse / Braintrust / LangSmith / Phoenix). Sources converge on this — testing pre-merge and watching in prod are different problems.
Modern eval runs at: - Offline — curated dataset regression suite. - Pre-merge CI — pytest-style assertions block bad prompt/model changes. - Online — sampled prod traffic scored continuously, feeds a "data flywheel" back into datasets.
LLM-judge methods report 80–90% agreement with human raters at 500–5000× lower cost. Every major framework (DeepEval G-Eval, Ragas, Braintrust, Phoenix evals) ships judge-prompt scaffolds. Pairwise comparisons (A/B) more consistent than absolute scores.
Is it for benchmarking base models?
├── academic / holistic → HELM
├── public benchmarks (MMLU, GSM8K) → lm-evaluation-harness
├── safety / capability / agent autonomy → Inspect AI
└── OpenAI-native → OpenAI Evals
Is it for an application (RAG, agent, chatbot)?
├── Need CI gating only → DeepEval (broad) | Promptfoo (CLI/red-team) | Ragas (RAG)
├── Need observability only → Langfuse (OSS) | LangSmith (LangChain) | Phoenix (OTel)
├── Need both in one platform → Braintrust (SaaS) | Langfuse (OSS, self-host)
└── Need compliance / guardrails → Giskard (EU) | Patronus (regulated)
If you are building an LLM application and have no existing eval tooling, start here:
For frontier model evaluation or safety work, the stack is different: Inspect AI + lm-evaluation-harness, plus internal private benchmarks.
Saved: /home/lucienne/workspace/docs/llm-eval-frameworks-2026-2026-05-31.md