Evals (LLM Evaluation) — AI Agent Glossary

Definition

Evals (short for evaluations) are the systematic testing methodology for LLM-powered systems. They provide quantitative answers to the question "is this system good enough?"—measuring accuracy, consistency, safety, and task-specific performance against a defined ground truth. Unlike traditional software tests that check binary pass/fail conditions, evals work with probabilistic outputs using scoring rubrics, similarity thresholds, and LLM-as-judge patterns to handle the inherent variability in generated text.

Engineering Context

Evals are the AI equivalent of unit tests. A production eval suite includes: a golden set of 50-200 representative inputs with verified correct outputs, automated scoring metrics (exact match, ROUGE, LLM-as-judge), and threshold gates for CI/CD deployment. Run evals on every prompt template change; block deployments if accuracy drops > 2% or hallucination rate exceeds threshold. Frameworks: OpenAI Evals, LangSmith, Braintrust, Ragas (for RAG pipelines). The most important eval design decision is choosing scoring metrics that correlate with real-world task success—a common failure mode is optimizing metrics that don't reflect actual quality.

Related Terms

Benchmark Ground Truth Determinism Regression Testing Confidence Score

Building production AI agents?

We design and implement deterministic AI agent systems for enterprise teams.

Start Assessment