MTTR Inputs
Example Data Table
Sample incidents for a week-long window. Enter the durations list to reproduce.
| Incident | Start | End | Duration (min) | Notes |
|---|---|---|---|---|
| INC-241 | Mon 09:10 | Mon 09:42 | 32 | Cache stampede |
| INC-242 | Tue 14:05 | Tue 14:50 | 45 | DB connection leak |
| INC-243 | Wed 21:11 | Wed 21:29 | 18 | Bad deploy rollback |
| INC-244 | Thu 02:03 | Thu 04:03 | 120 | Region networking |
| INC-245 | Sat 10:20 | Sat 10:58 | 38 | Queue backlog |
Formula Used
- MTTR (Mean): MTTR = (Σ recovery durations) / N
- Median: 50th percentile of sorted durations
- Tail (P90/P95): 90th and 95th percentiles
- Availability estimate: Availability = 1 − (Total Downtime / Window Time)
- SLA compliance: % = (Count where duration ≤ target) / N × 100
All inputs are converted to minutes first, then formatted to your selected output.
How to Use This Calculator
- Choose an input mode: list each incident duration, or use totals.
- Select your duration units and preferred output format.
- Optionally enable trimming to reduce outlier influence.
- Add an SLA target to measure compliance across incidents.
- Set a measurement window to estimate availability impact.
- Press Calculate to view results above the form.
- Use Download buttons to export CSV or PDF.
FAQs
1) What does MTTR measure in practice?
It measures average time to restore service after incidents. It includes detection, mitigation, rollback, and verification time, depending on how you define “recovery” in your incident process.
2) Why should I look at median and P90 too?
The mean can be distorted by rare long outages. Median shows typical recovery, while P90 highlights slow cases that users remember most and teams should prioritize.
3) When should I trim outliers?
Trim only for exploratory analysis or when comparing similar periods. Keep raw values for reporting. If an extreme event is real, it should inform resilience improvements, not disappear.
4) What is IQR trimming?
It removes values outside Q1−1.5×IQR and Q3+1.5×IQR. This is a common robust technique for reducing extreme influence in skewed operational datasets.
5) How is availability estimated here?
It uses the chosen window and total downtime: 1 − downtime/window. It’s a quick approximation for a single service; complex systems may need weighted or user-impact modeling.
6) What should I use as an SLA target?
Use the recovery time your team commits to internally or externally, like “restore within 60 minutes.” Then track the compliance percentage and investigate misses with postmortems.
7) Can I paste times like 1:30?
Yes. The calculator accepts hh:mm entries in the durations list. Mixed inputs are allowed; hh:mm values are treated as hours and minutes, regardless of the selected unit.
8) How can I improve MTTR?
Reduce detection time with alerts, improve diagnosis with runbooks, automate rollback, add feature flags, rehearse incident drills, and prioritize the top recurring failure modes.