How do you calculate MTTR (Mean Time To Recovery)?

MTTR = Total downtime / Number of incidents. For example, if you had 3 incidents with a combined downtime of 180 minutes, your MTTR is 180 / 3 = 60 minutes. The clock starts when users are impacted and stops when the service is fully restored.

What is a good MTTR for production infrastructure?

Elite teams (mature SRE practices, automated remediation) achieve MTTR under 1 hour. High-performing teams with good observability and on-call processes typically achieve 1-4 hours. Medium performers with basic monitoring average 4-24 hours. These benchmarks are based on DORA research.

What is the difference between MTTR, MTBF, MTTF, and MTTA?

MTTR (Mean Time To Recovery) measures how long it takes to restore service after failure. MTTA (Mean Time To Acknowledge) measures the gap between alert firing and human response. MTTF (Mean Time To Failure) measures how long a system runs before failing. MTBF (Mean Time Between Failures) is the full cycle: MTBF = MTTF + MTTR.

How can I reduce my team's MTTR?

Focus on four areas: Detection (faster health checks and monitoring), Response (reliable on-call schedules and escalation), Diagnosis (dashboards, logs, and runbooks), and Repair (automated rollbacks and feature flags). Detection is often the biggest win since many teams don't know about incidents until customers report them.

MTTR Calculator

How fast do you recover from incidents?

Total downtime (minutes)

Number of incidents

Mean Time To Recovery

0 h 00 m

MTTR benchmarks

Performance	MTTR	Typical profile
Elite	< 1h	Mature SRE teams, automated remediation
High	1-4h	Good observability, on-call processes
Medium	4-24h	Basic monitoring, manual response
Low	> 24h	Limited visibility, reactive only

Based on DORA research and industry data. Your ideal MTTR depends on your SLA commitments.

MTTR, MTTF, MTBF, MTTA; What's what?

MTTR (Mean Time To Recovery)

How long it takes to restore service after an incident starts. Clock starts when users are impacted, stops when they're not. This is what most teams track.

MTTA (Mean Time To Acknowledge)

How long until someone starts working on an incident. The gap between alert firing and a human responding. Long MTTA means slow pager response or alert fatigue.

MTTF (Mean Time To Failure)

How long a system runs before failing. Used for non-repairable components. If your SSD has a 1.5M hour MTTF, that's the average lifespan before it dies.

MTBF (Mean Time Between Failures)

For repairable systems: MTBF = MTTF + MTTR. It's the full cycle time from one failure to the next. Higher MTBF means more reliable systems.

The one that matters most: MTTR. You can't prevent all failures, but you can get faster at fixing them. Elite teams focus on reducing MTTR rather than chasing zero incidents.

Breaking down MTTR

MTTR is the sum of four phases. Improving any of them improves your total recovery time:

Detect

Time from failure occurring to alert firing. Faster checks, better thresholds, synthetic monitoring. Most teams lose minutes here without realizing.

Respond

Time from alert to human engagement. On-call schedules, escalation policies, reducing alert noise. If pagers go unanswered, everything else is moot.

Diagnose

Time to understand what's wrong. Logs, traces, dashboards, runbooks. Good observability turns 30-minute investigations into 2-minute ones.

Repair

Time to fix and verify. Rollbacks, feature flags, automated remediation. The goal is safe, fast recovery, not necessarily fixing the root cause yet.

Where to start: Detection is often the biggest win. If you're finding out about incidents from customers, you're already behind. Monitoring that catches issues in seconds instead of minutes compounds across every incident.

How to Calculate MTTR: Formula and Example

MTTR = Total downtime / Number of incidents

Say your team had 4 incidents last quarter:

Incident	Duration
Database failover delay	45 min
Bad deploy (rollback)	12 min
DNS propagation issue	90 min
Certificate expiration	33 min

MTTR = (45 + 12 + 90 + 33) / 4 = 180 / 4 = 45 minutes

That puts this team in the "elite" category by DORA standards. But averages can be misleading: that 90-minute DNS incident is doing the heavy lifting. Track individual incident durations alongside your MTTR so outliers don't hide behind the mean.

How to Reduce MTTR for Your Infrastructure

Every minute of MTTR is a minute your users are affected. Reducing it compounds over time: fewer SLA breaches, less stress on the team, and faster iteration on features instead of fighting fires. Here's where to focus:

Invest in detection first. The fastest way to reduce MTTR is to shrink the gap between "something broke" and "someone knows about it." If your monitoring checks every 5 minutes, you're losing up to 5 minutes before anyone even starts working on the problem. Sub-minute checks with intelligent alerting can cut detection time to seconds.

Write runbooks for common failures. At 3 AM, nobody thinks clearly. A runbook that says "if this alert fires, check X, then restart Y, then verify Z" turns a 30-minute diagnosis into a 5-minute checklist. Keep them short, link them from your alerts, and update them after every incident.

Make rollbacks trivial. If every deployment can be reverted in under a minute, your worst-case MTTR for deploy-related incidents drops to single digits. Feature flags, blue-green deploys, and canary releases all help here. The goal is making "undo" faster than "debug and fix forward."

Track and review your MTTR monthly. You can't improve what you don't measure. Run a monthly review of incidents, group them by root cause, and look for patterns. Are most incidents DNS-related? Invest in DNS redundancy. Are deploys the usual suspect? Improve your CI/CD pipeline. Let the data drive your reliability roadmap.

Explore next

Compare alternatives

See how Fivenines compares to other tools

Fivenines vs Datadog Fivenines vs Better Stack

Read our guide to the best infrastructure monitoring tools in 2026.

Catch issues before your customers do

Start monitoring with fivenines.io

MTTR Calculator

MTTR benchmarks

MTTR, MTTF, MTBF, MTTA; What's what?

MTTR (Mean Time To Recovery)

MTTA (Mean Time To Acknowledge)

MTTF (Mean Time To Failure)

MTBF (Mean Time Between Failures)

Breaking down MTTR

How to Calculate MTTR: Formula and Example

How to Reduce MTTR for Your Infrastructure

Related Resources

SLA Uptime Calculator

Server Alerting & Notifications

Custom Monitoring Dashboards

Load Average Interpreter

See how Fivenines compares to other tools