Case Study December 20, 2025 6 min read

40% MTTR Reduction: AI-Powered Incident Triage in Practice

How a fintech team reduced mean time to resolution by clustering alerts with AI and automating root cause analysis.

40%
MTTR Reduction
73%
Alert Noise Reduction
2.5x
Faster First Response

The Challenge

Our client, a European fintech processing 50,000+ transactions per hour, was drowning in alerts. Their on-call engineers faced:

The SRE team had tried rule-based alert grouping, but the complexity of their distributed system made static rules impossible to maintain.

Our Approach

We deployed an AI-powered triage agent that processes alerts in real-time. The agent performs three key functions:

1

Semantic Clustering

Instead of matching exact strings, the agent understands that "Connection timeout to postgres-primary" and "Database connection failed: read timeout" are the same issue. It groups alerts by semantic similarity, reducing 100 alerts to 5 actionable clusters.

2

Temporal Correlation

The agent tracks alert timing to identify cascading failures. When the payment service fails, it knows the downstream alerts from order-service and notification-service are symptoms, not root causes.

3

Hypothesis Generation

Rather than just presenting data, the agent generates ranked hypotheses with confidence scores. "Most likely: Database connection pool exhaustion (85% confidence). Second: Network partition between AZ-1 and AZ-2 (12% confidence)."

Implementation Timeline

Week 1-2
Data Integration

Connected to their existing stack: Datadog, PagerDuty, and internal deployment tracking.

Week 3-4
Pattern Learning

Analyzed 6 months of incident history to learn their system's failure patterns.

Week 5-6
Shadow Mode

Agent ran alongside human triage without taking action. Validated accuracy.

Week 7-8
Production Deployment

Gradual rollout with human-in-the-loop for all suggested actions.

A Real Incident: Before & After

Here's how the same type of incident played out before and after deploying the triage agent:

Before: Manual Triage
  • 00:00 - 47 alerts fire across 8 services
  • 00:03 - On-call paged, starts investigating
  • 00:15 - Still trying to find root cause
  • 00:28 - Identifies database as the issue
  • 00:35 - Restarts connection pool
  • 00:42 - Services recover
MTTR: 42 minutes
After: AI-Assisted Triage
  • 00:00 - 47 alerts fire across 8 services
  • 00:00 - Agent clusters into 1 incident
  • 00:01 - Hypothesis: DB pool exhaustion (92%)
  • 00:02 - On-call paged with root cause
  • 00:08 - Restarts connection pool
  • 00:15 - Services recover
MTTR: 15 minutes

Key Success Factors

  1. Integration with existing tools

    We didn't replace Datadog or PagerDuty. The agent sits between them, enriching alerts before they reach humans.

  2. Traceable reasoning

    Every hypothesis comes with an explanation. Engineers can see why the agent thinks the database is the issue.

  3. Human-in-the-loop

    The agent suggests; humans decide. This built trust with the SRE team.

  4. Continuous learning

    When engineers provide feedback on hypotheses, the agent improves over time.

Results After 3 Months

Mean Time to Resolution
45 min 27 min
Alerts Per Day (After Clustering)
2,000+ ~540
First Response Time
8 min 3 min
On-Call Satisfaction
2.1/5 4.2/5
"The agent doesn't just reduce noise - it gives us a starting point for every investigation. Instead of spending 15 minutes figuring out what's wrong, we spend that time fixing it."
— SRE Lead, Fintech Client

Drowning in alerts?

Let us analyze your incident patterns and show you what's possible.

Start Assessment