Mean Time To Recovery Calculator

MTTR Inputs

Input mode

Durations give richer percentiles.

Output format

All math uses minutes internally.

Measurement window (hours)

Used for availability estimate.

Incident durations

Separate values with spaces, commas, or new lines. You can use hh:mm too.

Duration unit

Ignored for hh:mm entries.

Outlier handling

Use with care; track raw too.

Trim low percentile

Example: 5 removes the lowest 5%.

Trim high percentile

Example: 95 removes the top 5%.

SLA target (optional)

Used to compute compliance rate.

SLA target unit

Convert your SLA into a duration.

Availability target (%)

Used for target gap reporting.

Reset

Example Data Table

Sample incidents for a week-long window. Enter the durations list to reproduce.

Incident	Start	End	Duration (min)	Notes
INC-241	Mon 09:10	Mon 09:42	32	Cache stampede
INC-242	Tue 14:05	Tue 14:50	45	DB connection leak
INC-243	Wed 21:11	Wed 21:29	18	Bad deploy rollback
INC-244	Thu 02:03	Thu 04:03	120	Region networking
INC-245	Sat 10:20	Sat 10:58	38	Queue backlog

Tip: Try percentile trimming if a single extreme incident dominates.

Formula Used

MTTR (Mean): MTTR = (Σ recovery durations) / N
Median: 50th percentile of sorted durations
Tail (P90/P95): 90th and 95th percentiles
Availability estimate: Availability = 1 − (Total Downtime / Window Time)
SLA compliance: % = (Count where duration ≤ target) / N × 100

All inputs are converted to minutes first, then formatted to your selected output.

How to Use This Calculator

Choose an input mode: list each incident duration, or use totals.
Select your duration units and preferred output format.
Optionally enable trimming to reduce outlier influence.
Add an SLA target to measure compliance across incidents.
Set a measurement window to estimate availability impact.
Press Calculate to view results above the form.
Use Download buttons to export CSV or PDF.

FAQs

1) What does MTTR measure in practice?

It measures average time to restore service after incidents. It includes detection, mitigation, rollback, and verification time, depending on how you define “recovery” in your incident process.

2) Why should I look at median and P90 too?

The mean can be distorted by rare long outages. Median shows typical recovery, while P90 highlights slow cases that users remember most and teams should prioritize.

3) When should I trim outliers?

Trim only for exploratory analysis or when comparing similar periods. Keep raw values for reporting. If an extreme event is real, it should inform resilience improvements, not disappear.

4) What is IQR trimming?

It removes values outside Q1−1.5×IQR and Q3+1.5×IQR. This is a common robust technique for reducing extreme influence in skewed operational datasets.

5) How is availability estimated here?

It uses the chosen window and total downtime: 1 − downtime/window. It’s a quick approximation for a single service; complex systems may need weighted or user-impact modeling.

6) What should I use as an SLA target?

Use the recovery time your team commits to internally or externally, like “restore within 60 minutes.” Then track the compliance percentage and investigate misses with postmortems.

7) Can I paste times like 1:30?

Yes. The calculator accepts hh:mm entries in the durations list. Mixed inputs are allowed; hh:mm values are treated as hours and minutes, regardless of the selected unit.

8) How can I improve MTTR?

Reduce detection time with alerts, improve diagnosis with runbooks, automate rollback, add feature flags, rehearse incident drills, and prioritize the top recurring failure modes.