Testing AI Agents: Strategies for Non-Deterministic Systems

"How do you test something that's inherently non-deterministic?" It's the question every AI engineering team faces. The answer requires rethinking what "passing a test" means—shifting from exact output matching to statistical assertions and behavior contracts.

The Testing Pyramid for AI Systems

The standard testing pyramid (unit → integration → E2E) maps onto AI systems, but with different layer definitions:

Unit tests — Test deterministic components: schema validators, guardrails logic, retrieval indexing, tool execution
Eval tests — LLM-in-the-loop tests against a golden set of inputs and expected outputs (not exact matches—semantic similarity or LLM-as-judge)
Integration tests — Full agent runs against realistic scenarios with pass/fail criteria based on behavior, not output text
Shadow testing — Run new versions in parallel with production, comparing outputs before switching

Golden Set Evaluation

Build and maintain a curated golden set: 50-200 real examples with verified correct answers. For each prompt update or model change, run the full golden set and track:

# Golden set evaluation metrics

Accuracy: 87.3% (was 85.1%) ↑ 2.2%

Hallucination: 2.1% (was 3.4%) ↓ 1.3%

Format errors: 0.8% (was 0.8%) → 0%

Latency P95: 1.4s (was 1.2s) ↑ 0.2s

Cost/request: $0.018 (was $0.012) ↑ 50%

The last two metrics show a trade-off: the new prompt is more accurate but slower and more expensive. Evaluation surfaces these trade-offs explicitly.

LLM-as-Judge Evaluation

For outputs that can't be verified by exact matching (summaries, explanations, analyses), use a separate LLM as an evaluator. Define explicit rubrics:

Factual accuracy — Does the output contradict source documents? (1-5 score)
Completeness — Are all required elements present? (checklist)
Tone compliance — Does it match the required tone/format? (binary)
Hallucination flag — Does it assert facts not present in the context? (binary)

Adversarial Testing

Test your guardrails specifically by throwing adversarial inputs at your system:

Prompt injection attempts ("Ignore previous instructions and...")
Edge case inputs (empty strings, very long inputs, non-standard characters)
Boundary conditions (inputs just at the confidence threshold)
Semantically valid but operationally harmful requests

CI/CD Integration

Run eval tests in your CI pipeline on every prompt template change. Set hard gates: if accuracy on the golden set drops more than 2%, or hallucination rate exceeds 3%, block the deployment. Make AI quality a first-class citizen in your deployment process.

Ship AI with confidence.

We implement eval frameworks and testing infrastructure that give engineering teams confidence in AI system changes before they reach production.

Start Assessment