"How do you test something that's inherently non-deterministic?" It's the question every AI engineering team faces. The answer requires rethinking what "passing a test" means—shifting from exact output matching to statistical assertions and behavior contracts.
The Testing Pyramid for AI Systems
The standard testing pyramid (unit → integration → E2E) maps onto AI systems, but with different layer definitions:
- Unit tests — Test deterministic components: schema validators, guardrails logic, retrieval indexing, tool execution
- Eval tests — LLM-in-the-loop tests against a golden set of inputs and expected outputs (not exact matches—semantic similarity or LLM-as-judge)
- Integration tests — Full agent runs against realistic scenarios with pass/fail criteria based on behavior, not output text
- Shadow testing — Run new versions in parallel with production, comparing outputs before switching
Golden Set Evaluation
Build and maintain a curated golden set: 50-200 real examples with verified correct answers. For each prompt update or model change, run the full golden set and track:
The last two metrics show a trade-off: the new prompt is more accurate but slower and more expensive. Evaluation surfaces these trade-offs explicitly.
LLM-as-Judge Evaluation
For outputs that can't be verified by exact matching (summaries, explanations, analyses), use a separate LLM as an evaluator. Define explicit rubrics:
- Factual accuracy — Does the output contradict source documents? (1-5 score)
- Completeness — Are all required elements present? (checklist)
- Tone compliance — Does it match the required tone/format? (binary)
- Hallucination flag — Does it assert facts not present in the context? (binary)
Adversarial Testing
Test your guardrails specifically by throwing adversarial inputs at your system:
- Prompt injection attempts ("Ignore previous instructions and...")
- Edge case inputs (empty strings, very long inputs, non-standard characters)
- Boundary conditions (inputs just at the confidence threshold)
- Semantically valid but operationally harmful requests
CI/CD Integration
Run eval tests in your CI pipeline on every prompt template change. Set hard gates: if accuracy on the golden set drops more than 2%, or hallucination rate exceeds 3%, block the deployment. Make AI quality a first-class citizen in your deployment process.
Ship AI with confidence.
We implement eval frameworks and testing infrastructure that give engineering teams confidence in AI system changes before they reach production.
Start Assessment