The time to restore service after service incidents, rollbacks, or any type of production failure happened.
This metric is essential to measure the disaster control capability of your team and the robustness of the software.
DORA dashboard. See live demo.
MTTR = Total incident age (in hours)/number of incidents.
If you have three incidents that happened in the given data range, one lasting 1 hour, one lasting 2 hours and one lasting 3 hours. Your MTTR will be: (1 + 2 + 3) / 3 = 2 hours.
Below are the benchmarks for different development teams from Google‘s report. However, it’s difficult to tell which group a team falls into when the team's median time to restore service is between one week and six months
. Therefore, DevLake provides its own benchmarks to address this problem:
Groups | Benchmarks | DevLake Benchmarks |
---|---|---|
Elite performers | Less than one hour | Less than one hour |
High performers | Less one day | Less than one day |
Medium performers | Between one day and one week | Between one day and one week |
Low performers | More than six months | More than one week |
Data Sources Required
This metric relies on:
Deployments
collected in one of the following ways:Incidents
collected in one of the following ways:Transformation Rules Required
This metric relies on:
Deployments
.Incidents
.SQL Queries
If you want to measure the monthly trend of median time to restore service as the picture shown below, run the following SQL in Grafana.
with _incidents as ( -- get the incident count each month SELECT date_format(created_date,'%y/%m') as month, cast(lead_time_minutes as signed) as lead_time_minutes FROM issues WHERE type = 'INCIDENT' ), _find_median_mttr_each_month as ( SELECT x.* from _incidents x join _incidents y on x.month = y.month WHERE x.lead_time_minutes is not null and y.lead_time_minutes is not null GROUP BY x.month, x.lead_time_minutes HAVING SUM(SIGN(1-SIGN(y.lead_time_minutes-x.lead_time_minutes)))/COUNT(*) > 0.5 ), _find_mttr_rank_each_month as ( SELECT *, rank() over(PARTITION BY month ORDER BY lead_time_minutes) as _rank FROM _find_median_mttr_each_month ), _mttr as ( SELECT month, lead_time_minutes as med_time_to_resolve from _find_mttr_rank_each_month WHERE _rank = 1 ), _calendar_months as( -- deal with the month with no incidents SELECT date_format(CAST((SYSDATE()-INTERVAL (month_index) MONTH) AS date), '%y/%m') as month FROM ( SELECT 0 month_index UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10 UNION ALL SELECT 11 ) month_index WHERE (SYSDATE()-INTERVAL (month_index) MONTH) > SYSDATE()-INTERVAL 6 MONTH ) SELECT cm.month, case when m.med_time_to_resolve is null then 0 else m.med_time_to_resolve/60 end as med_time_to_resolve_in_hour FROM _calendar_months cm left join _mttr m on cm.month = m.month ORDER BY 1
If you want to measure in which category your team falls into as the picture shown below, run the following SQL in Grafana.
with _incidents as ( -- get the incidents created within the selected time period in the top-right corner SELECT cast(lead_time_minutes as signed) as lead_time_minutes FROM issues WHERE type = 'INCIDENT' and $__timeFilter(created_date) ), _median_mttr as ( SELECT x.lead_time_minutes as med_time_to_resolve from _incidents x, _incidents y WHERE x.lead_time_minutes is not null and y.lead_time_minutes is not null GROUP BY x.lead_time_minutes HAVING SUM(SIGN(1-SIGN(y.lead_time_minutes-x.lead_time_minutes)))/COUNT(*) > 0.5 LIMIT 1 ) SELECT case WHEN med_time_to_resolve < 60 then "Less than one hour" WHEN med_time_to_resolve < 24 * 60 then "Less than one Day" WHEN med_time_to_resolve < 7 * 24 * 60 then "Between one day and one week" ELSE "More than one week" END as med_time_to_resolve FROM _median_mttr