MTTR Calculator

How fast do you recover from incidents?

Mean Time To Recovery
0 h 00 m

MTTR benchmarks

Performance MTTR Typical profile
Elite < 1h Mature SRE teams, automated remediation
High 1-4h Good observability, on-call processes
Medium 4-24h Basic monitoring, manual response
Low > 24h Limited visibility, reactive only

Based on DORA research and industry data. Your ideal MTTR depends on your SLA commitments.

MTTR, MTTF, MTBF, MTTA; What's what?

MTTR (Mean Time To Recovery)

How long it takes to restore service after an incident starts. Clock starts when users are impacted, stops when they're not. This is what most teams track.

MTTA (Mean Time To Acknowledge)

How long until someone starts working on an incident. The gap between alert firing and a human responding. Long MTTA means slow pager response or alert fatigue.

MTTF (Mean Time To Failure)

How long a system runs before failing. Used for non-repairable components. If your SSD has a 1.5M hour MTTF, that's the average lifespan before it dies.

MTBF (Mean Time Between Failures)

For repairable systems: MTBF = MTTF + MTTR. It's the full cycle time from one failure to the next. Higher MTBF means more reliable systems.

The one that matters most: MTTR. You can't prevent all failures, but you can get faster at fixing them. Elite teams focus on reducing MTTR rather than chasing zero incidents.

Breaking down MTTR

MTTR is the sum of four phases. Improving any of them improves your total recovery time:

Detect

Time from failure occurring to alert firing. Faster checks, better thresholds, synthetic monitoring. Most teams lose minutes here without realizing.

Respond

Time from alert to human engagement. On-call schedules, escalation policies, reducing alert noise. If pagers go unanswered, everything else is moot.

Diagnose

Time to understand what's wrong. Logs, traces, dashboards, runbooks. Good observability turns 30-minute investigations into 2-minute ones.

Repair

Time to fix and verify. Rollbacks, feature flags, automated remediation. The goal is safe, fast recovery, not necessarily fixing the root cause yet.

Where to start: Detection is often the biggest win. If you're finding out about incidents from customers, you're already behind. Monitoring that catches issues in seconds instead of minutes compounds across every incident.

How to Calculate MTTR: Formula and Example

MTTR = Total downtime / Number of incidents

Say your team had 4 incidents last quarter:

Incident Duration
Database failover delay 45 min
Bad deploy (rollback) 12 min
DNS propagation issue 90 min
Certificate expiration 33 min

MTTR = (45 + 12 + 90 + 33) / 4 = 180 / 4 = 45 minutes

That puts this team in the "elite" category by DORA standards. But averages can be misleading: that 90-minute DNS incident is doing the heavy lifting. Track individual incident durations alongside your MTTR so outliers don't hide behind the mean.

How to Reduce MTTR for Your Infrastructure

Every minute of MTTR is a minute your users are affected. Reducing it compounds over time: fewer SLA breaches, less stress on the team, and faster iteration on features instead of fighting fires. Here's where to focus:

Invest in detection first. The fastest way to reduce MTTR is to shrink the gap between "something broke" and "someone knows about it." If your monitoring checks every 5 minutes, you're losing up to 5 minutes before anyone even starts working on the problem. Sub-minute checks with intelligent alerting can cut detection time to seconds.

Write runbooks for common failures. At 3 AM, nobody thinks clearly. A runbook that says "if this alert fires, check X, then restart Y, then verify Z" turns a 30-minute diagnosis into a 5-minute checklist. Keep them short, link them from your alerts, and update them after every incident.

Make rollbacks trivial. If every deployment can be reverted in under a minute, your worst-case MTTR for deploy-related incidents drops to single digits. Feature flags, blue-green deploys, and canary releases all help here. The goal is making "undo" faster than "debug and fix forward."

Track and review your MTTR monthly. You can't improve what you don't measure. Run a monthly review of incidents, group them by root cause, and look for patterns. Are most incidents DNS-related? Invest in DNS redundancy. Are deploys the usual suspect? Improve your CI/CD pipeline. Let the data drive your reliability roadmap.

Related Resources

Catch issues before your customers do

Start monitoring with fivenines.io