MTTR Calculator
How fast do you recover from incidents?
MTTR benchmarks
| Performance | MTTR | Typical profile |
|---|---|---|
| Elite | < 1h | Mature SRE teams, automated remediation |
| High | 1-4h | Good observability, on-call processes |
| Medium | 4-24h | Basic monitoring, manual response |
| Low | > 24h | Limited visibility, reactive only |
Based on DORA research and industry data. Your ideal MTTR depends on your SLA commitments.
MTTR, MTTF, MTBF, MTTA; What's what?
MTTR (Mean Time To Recovery)
How long it takes to restore service after an incident starts. Clock starts when users are impacted, stops when they're not. This is what most teams track.
MTTA (Mean Time To Acknowledge)
How long until someone starts working on an incident. The gap between alert firing and a human responding. Long MTTA means slow pager response or alert fatigue.
MTTF (Mean Time To Failure)
How long a system runs before failing. Used for non-repairable components. If your SSD has a 1.5M hour MTTF, that's the average lifespan before it dies.
MTBF (Mean Time Between Failures)
For repairable systems: MTBF = MTTF + MTTR. It's the full cycle time from one failure to the next. Higher MTBF means more reliable systems.
The one that matters most: MTTR. You can't prevent all failures, but you can get faster at fixing them. Elite teams focus on reducing MTTR rather than chasing zero incidents.
Breaking down MTTR
MTTR is the sum of four phases. Improving any of them improves your total recovery time:
Time from failure occurring to alert firing. Faster checks, better thresholds, synthetic monitoring. Most teams lose minutes here without realizing.
Time from alert to human engagement. On-call schedules, escalation policies, reducing alert noise. If pagers go unanswered, everything else is moot.
Time to understand what's wrong. Logs, traces, dashboards, runbooks. Good observability turns 30-minute investigations into 2-minute ones.
Time to fix and verify. Rollbacks, feature flags, automated remediation. The goal is safe, fast recovery, not necessarily fixing the root cause yet.
Where to start: Detection is often the biggest win. If you're finding out about incidents from customers, you're already behind. Monitoring that catches issues in seconds instead of minutes compounds across every incident.
How to Calculate MTTR: Formula and Example
MTTR = Total downtime / Number of incidents
Say your team had 4 incidents last quarter:
| Incident | Duration |
|---|---|
| Database failover delay | 45 min |
| Bad deploy (rollback) | 12 min |
| DNS propagation issue | 90 min |
| Certificate expiration | 33 min |
MTTR = (45 + 12 + 90 + 33) / 4 = 180 / 4 = 45 minutes
That puts this team in the "elite" category by DORA standards. But averages can be misleading: that 90-minute DNS incident is doing the heavy lifting. Track individual incident durations alongside your MTTR so outliers don't hide behind the mean.
How to Reduce MTTR for Your Infrastructure
Every minute of MTTR is a minute your users are affected. Reducing it compounds over time: fewer SLA breaches, less stress on the team, and faster iteration on features instead of fighting fires. Here's where to focus:
Invest in detection first. The fastest way to reduce MTTR is to shrink the gap between "something broke" and "someone knows about it." If your monitoring checks every 5 minutes, you're losing up to 5 minutes before anyone even starts working on the problem. Sub-minute checks with intelligent alerting can cut detection time to seconds.
Write runbooks for common failures. At 3 AM, nobody thinks clearly. A runbook that says "if this alert fires, check X, then restart Y, then verify Z" turns a 30-minute diagnosis into a 5-minute checklist. Keep them short, link them from your alerts, and update them after every incident.
Make rollbacks trivial. If every deployment can be reverted in under a minute, your worst-case MTTR for deploy-related incidents drops to single digits. Feature flags, blue-green deploys, and canary releases all help here. The goal is making "undo" faster than "debug and fix forward."
Track and review your MTTR monthly. You can't improve what you don't measure. Run a monthly review of incidents, group them by root cause, and look for patterns. Are most incidents DNS-related? Invest in DNS redundancy. Are deploys the usual suspect? Improve your CI/CD pipeline. Let the data drive your reliability roadmap.
Related Resources
- SLA Uptime Calculator -See how much downtime your SLA actually allows and track your error budget
- Server Alerting & Notifications -Reduce detection time with sub-minute health checks and instant alerts
- Custom Monitoring Dashboards -Visualize incident trends and system health at a glance
- Load Average Interpreter -Diagnose server performance issues faster during incidents
Catch issues before your customers do
Start monitoring with fivenines.io