When a client's AI system processes 50,000 requests per day, a 60% reduction in token usage translates to tens of thousands of dollars per month. These aren't theoretical savings—they come from six specific techniques we've applied across multiple production deployments.
Technique 1: Prompt Compression
System prompts tend to bloat over time as teams add edge case handling. We've seen system prompts grow to 3,000+ tokens. Apply these rules:
- Remove redundant instructions (the model already knows its own capabilities)
- Replace verbose examples with concise ones—quality beats quantity
- Use token-efficient phrasing: "Respond in JSON" not "Please format your response as valid JSON output"
- Audit every instruction: if it's there "just in case," remove it and test
Typical savings: 20-35% reduction in prompt tokens with no quality loss.
Technique 2: Semantic Caching
Not all requests are unique. For knowledge-heavy agents, we implement semantic caching: embed the user query, check for similar past queries in Redis, and return cached responses if similarity exceeds 0.95.
Typical savings: 30-60% cost reduction for high-volume knowledge retrieval workflows.
Technique 3: Model Routing
Not every query needs GPT-4o or Claude 3.5 Sonnet. Route simple classification and extraction tasks to cheaper, faster models:
| Task Type | Recommended Model | Cost vs Frontier |
|---|---|---|
| Intent classification | Haiku / Flash | 95% cheaper |
| Structured extraction | Haiku / GPT-4o mini | 90% cheaper |
| Summarization | Sonnet / Flash | 70% cheaper |
| Complex reasoning | Opus / GPT-4o | Baseline |
Techniques 4–6: Advanced Strategies
- RAG context trimming — Don't inject all retrieved chunks. Rerank and trim to the top 3-5 most relevant. Each additional chunk costs tokens without proportional quality gains.
- Response streaming with early stopping — For validation steps, you don't need the full response. Stream and stop when you've extracted the classification or flag.
- Prompt caching (native) — Anthropic and OpenAI both offer prompt caching for large stable system prompts. Enable this if your system prompt exceeds 1,024 tokens.
Combined Impact
Applied together, these techniques consistently deliver 50-65% cost reduction without measurable quality degradation. The key is measuring first: instrument your token usage by step and workflow before optimizing.
Running high LLM API bills?
We audit production AI systems and implement cost optimization strategies. Most clients see payback in the first month.
Start Assessment