LLMs are inherently probabilistic. Given the same input, they can produce different outputs. For research and creative applications, this is a feature. For production engineering systems, it's a bug. Here's how we achieve consistent, reproducible outputs in our agent deployments.
The Determinism Spectrum
First, let's be precise about what "deterministic" means in LLM contexts:
Same prompt always produces exact same tokens. Achievable only with temperature=0 and fixed seeds (not all APIs support this).
Different words, same meaning. "The risk is HIGH" vs "This poses a HIGH risk." Our usual target.
Output always follows the same schema, even if content varies. JSON with fixed keys. Minimum viable consistency.
For most production use cases, we target Level 2 (semantic equivalence) with Level 3 (structural consistency) as a hard requirement.
Technique 1: Structured Output Formats
The single most impactful change: always request structured output. Free-form text is impossible to parse reliably.
Modern APIs support JSON mode natively. Use it.
response = client.chat.completions.create(
model="gpt-4-turbo",
response_format={"type": "json_object"},
messages=[{
"role": "system",
"content": """You analyze contracts for risk.
Return JSON with this exact schema:
{
"risks": [{"clause": str, "type": str, "severity": "HIGH"|"MEDIUM"|"LOW"}],
"summary": str,
"confidence": float
}"""
}, {
"role": "user",
"content": contract_text
}]
)
Technique 2: Constrained Vocabularies
Don't let the model invent terminology. Define exactly what values are acceptable.
CLASSIFICATION_PROMPT = """
Classify this log entry.
ALLOWED VALUES (use EXACTLY these):
- category: "error" | "warning" | "info" | "debug"
- service: "api" | "database" | "cache" | "queue" | "other"
- action_required: true | false
Log entry: {log}
Return JSON with keys: category, service, action_required
"""
Then validate the output programmatically:
from pydantic import BaseModel, validator
from typing import Literal
class LogClassification(BaseModel):
category: Literal["error", "warning", "info", "debug"]
service: Literal["api", "database", "cache", "queue", "other"]
action_required: bool
@validator("category", "service", pre=True)
def lowercase(cls, v):
return v.lower() if isinstance(v, str) else v
# Parse and validate
result = LogClassification.parse_raw(llm_response)
Technique 3: Temperature and Sampling
For deterministic outputs, use temperature=0. This selects the highest-probability token at each step.
Technique 4: Chain-of-Thought with Validation
For complex reasoning, use chain-of-thought prompting but validate each step:
ANALYSIS_PROMPT = """
Analyze this incident in steps. After each step, state your confidence.
STEP 1: Identify the affected services
- List services mentioned in the logs
- Confidence: HIGH/MEDIUM/LOW
STEP 2: Determine the timeline
- First error timestamp
- Last error timestamp
- Confidence: HIGH/MEDIUM/LOW
STEP 3: Hypothesize root cause
- Primary hypothesis with evidence
- Alternative hypothesis if confidence < HIGH
- Confidence: HIGH/MEDIUM/LOW
STEP 4: Recommend action
- If all steps are HIGH confidence: recommend specific fix
- If any step is LOW confidence: recommend investigation steps
Return JSON with steps array, each containing: step_name, findings, confidence
"""
This makes the reasoning explicit and auditable. When confidence is low, the system routes to human review instead of guessing.
Technique 5: Few-Shot Examples
Show the model exactly what you want with examples that cover edge cases:
CLASSIFICATION_PROMPT = """
Classify contract clauses by risk level.
EXAMPLES:
Input: "Payment due within 30 days of invoice"
Output: {"risk": "LOW", "reason": "Standard payment terms"}
Input: "Vendor shall provide reasonable support"
Output: {"risk": "HIGH", "reason": "Ambiguous - 'reasonable' undefined"}
Input: "Client may terminate with 90 days notice"
Output: {"risk": "MEDIUM", "reason": "Long notice period but standard"}
Input: "All liability is unlimited"
Output: {"risk": "CRITICAL", "reason": "Uncapped financial exposure"}
Now classify:
Input: {clause}
Output:
"""
Technique 6: Output Validation and Retry
Never trust LLM output blindly. Always validate and retry on failure:
import json
from tenacity import retry, stop_after_attempt, retry_if_exception_type
class OutputValidationError(Exception):
pass
@retry(
stop=stop_after_attempt(3),
retry=retry_if_exception_type(OutputValidationError)
)
def get_validated_response(prompt: str, schema: type) -> dict:
response = call_llm(prompt, temperature=0)
try:
parsed = json.loads(response)
validated = schema.parse_obj(parsed)
return validated.dict()
except (json.JSONDecodeError, ValidationError) as e:
# Log the failure for monitoring
log_validation_failure(prompt, response, e)
raise OutputValidationError(f"Invalid output: {e}")
# Usage
result = get_validated_response(
prompt=RISK_ANALYSIS_PROMPT.format(document=doc),
schema=RiskAnalysisOutput
)
Putting It All Together
Here's our production prompt template that combines all techniques:
Key Takeaways
- Always use structured output - JSON with fixed schemas
- Constrain vocabularies - Define allowed values explicitly
- Set temperature=0 - For classification and extraction tasks
- Include confidence scores - Route uncertainty to humans
- Validate and retry - Never trust raw LLM output
- Use few-shot examples - Show, don't just tell
Deterministic LLM outputs aren't about eliminating all variation—they're about making outputs predictable enough to build reliable systems on top of. With these techniques, we've achieved 99.7% output schema compliance across millions of production calls.
Need help building production-grade prompts?
We design deterministic AI agents that meet enterprise reliability requirements.
Start Assessment