Mean Time To Recovery Calculator

Turn downtime entries into actionable recovery insights simple. Benchmark mean, median, and tail behavior fast. Improve reliability using clear targets every week consistently today.

MTTR Inputs

Durations give richer percentiles.
All math uses minutes internally.
Used for availability estimate.
Separate values with spaces, commas, or new lines. You can use hh:mm too.
Ignored for hh:mm entries.
Use with care; track raw too.
Example: 5 removes the lowest 5%.
Example: 95 removes the top 5%.
Used to compute compliance rate.
Convert your SLA into a duration.
Used for target gap reporting.
Reset

Example Data Table

Sample incidents for a week-long window. Enter the durations list to reproduce.

Incident Start End Duration (min) Notes
INC-241Mon 09:10Mon 09:4232Cache stampede
INC-242Tue 14:05Tue 14:5045DB connection leak
INC-243Wed 21:11Wed 21:2918Bad deploy rollback
INC-244Thu 02:03Thu 04:03120Region networking
INC-245Sat 10:20Sat 10:5838Queue backlog
Tip: Try percentile trimming if a single extreme incident dominates.

Formula Used

  • MTTR (Mean): MTTR = (Σ recovery durations) / N
  • Median: 50th percentile of sorted durations
  • Tail (P90/P95): 90th and 95th percentiles
  • Availability estimate: Availability = 1 − (Total Downtime / Window Time)
  • SLA compliance: % = (Count where duration ≤ target) / N × 100

All inputs are converted to minutes first, then formatted to your selected output.

How to Use This Calculator

  1. Choose an input mode: list each incident duration, or use totals.
  2. Select your duration units and preferred output format.
  3. Optionally enable trimming to reduce outlier influence.
  4. Add an SLA target to measure compliance across incidents.
  5. Set a measurement window to estimate availability impact.
  6. Press Calculate to view results above the form.
  7. Use Download buttons to export CSV or PDF.

FAQs

1) What does MTTR measure in practice?

It measures average time to restore service after incidents. It includes detection, mitigation, rollback, and verification time, depending on how you define “recovery” in your incident process.

2) Why should I look at median and P90 too?

The mean can be distorted by rare long outages. Median shows typical recovery, while P90 highlights slow cases that users remember most and teams should prioritize.

3) When should I trim outliers?

Trim only for exploratory analysis or when comparing similar periods. Keep raw values for reporting. If an extreme event is real, it should inform resilience improvements, not disappear.

4) What is IQR trimming?

It removes values outside Q1−1.5×IQR and Q3+1.5×IQR. This is a common robust technique for reducing extreme influence in skewed operational datasets.

5) How is availability estimated here?

It uses the chosen window and total downtime: 1 − downtime/window. It’s a quick approximation for a single service; complex systems may need weighted or user-impact modeling.

6) What should I use as an SLA target?

Use the recovery time your team commits to internally or externally, like “restore within 60 minutes.” Then track the compliance percentage and investigate misses with postmortems.

7) Can I paste times like 1:30?

Yes. The calculator accepts hh:mm entries in the durations list. Mixed inputs are allowed; hh:mm values are treated as hours and minutes, regardless of the selected unit.

8) How can I improve MTTR?

Reduce detection time with alerts, improve diagnosis with runbooks, automate rollback, add feature flags, rehearse incident drills, and prioritize the top recurring failure modes.

Related Calculators

team capacity calculator

Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.