The Challenge
Our client, a European fintech processing 50,000+ transactions per hour, was drowning in alerts. Their on-call engineers faced:
- 2,000+ alerts per day across 40+ microservices
- 85% noise rate - most alerts were duplicates or low-priority
- 45-minute average MTTR for P1 incidents
- On-call burnout leading to engineer attrition
The SRE team had tried rule-based alert grouping, but the complexity of their distributed system made static rules impossible to maintain.
Our Approach
We deployed an AI-powered triage agent that processes alerts in real-time. The agent performs three key functions:
Semantic Clustering
Instead of matching exact strings, the agent understands that "Connection timeout to postgres-primary" and "Database connection failed: read timeout" are the same issue. It groups alerts by semantic similarity, reducing 100 alerts to 5 actionable clusters.
Temporal Correlation
The agent tracks alert timing to identify cascading failures. When the payment service fails, it knows the downstream alerts from order-service and notification-service are symptoms, not root causes.
Hypothesis Generation
Rather than just presenting data, the agent generates ranked hypotheses with confidence scores. "Most likely: Database connection pool exhaustion (85% confidence). Second: Network partition between AZ-1 and AZ-2 (12% confidence)."
Implementation Timeline
Connected to their existing stack: Datadog, PagerDuty, and internal deployment tracking.
Analyzed 6 months of incident history to learn their system's failure patterns.
Agent ran alongside human triage without taking action. Validated accuracy.
Gradual rollout with human-in-the-loop for all suggested actions.
A Real Incident: Before & After
Here's how the same type of incident played out before and after deploying the triage agent:
- 00:00 - 47 alerts fire across 8 services
- 00:03 - On-call paged, starts investigating
- 00:15 - Still trying to find root cause
- 00:28 - Identifies database as the issue
- 00:35 - Restarts connection pool
- 00:42 - Services recover
- 00:00 - 47 alerts fire across 8 services
- 00:00 - Agent clusters into 1 incident
- 00:01 - Hypothesis: DB pool exhaustion (92%)
- 00:02 - On-call paged with root cause
- 00:08 - Restarts connection pool
- 00:15 - Services recover
Key Success Factors
-
Integration with existing tools
We didn't replace Datadog or PagerDuty. The agent sits between them, enriching alerts before they reach humans.
-
Traceable reasoning
Every hypothesis comes with an explanation. Engineers can see why the agent thinks the database is the issue.
-
Human-in-the-loop
The agent suggests; humans decide. This built trust with the SRE team.
-
Continuous learning
When engineers provide feedback on hypotheses, the agent improves over time.
Results After 3 Months
"The agent doesn't just reduce noise - it gives us a starting point for every investigation. Instead of spending 15 minutes figuring out what's wrong, we spend that time fixing it."
Drowning in alerts?
Let us analyze your incident patterns and show you what's possible.
Start Assessment