| --- |
| title: "DORA - Change Failure Rate" |
| description: > |
| DORA - Change Failure Rate |
| sidebar_position: 29 |
| --- |
| |
| ## What is this metric? |
| The percentage of changes that were made to a code that then resulted in incidents, rollbacks, or any type of production failure. |
| |
| ## Why is it important? |
| Unlike Deployment Frequency and Lead Time for Changes that measure the throughput, Change Failure Rate measures the stability and quality of software delivery. A low CFR reflects a bad end-user experience as the production failure is relatively high. |
| |
| ## Which dashboard(s) does it exist in |
| DORA dashboard. See [live demo](https://grafana-lake.demo.devlake.io/grafana/d/qNo8_0M4z/dora?orgId=1). |
| |
| |
| ## How is it calculated? |
| The number of deployments affected by incidents/total number of deployments. For example, if there are five deployments and two deployments cause one or more incidents, that is a 40% change failure rate. |
| |
|  |
| |
| Below are the benchmarks for different development teams from Google's report. However, it's difficult to tell which group a team falls into when the team's change failure rate is `18%` or `40%`. Therefore, DevLake provides its own benchmarks to address this problem: |
| |
| | Groups | Benchmarks | DevLake Benchmarks | |
| | -----------------| ----------------| -------------------| |
| | Elite performers | 0%-15% | 0%-15% | |
| | High performers | 16%-30% | 16-20% | |
| | Medium performers| 16%-30% | 21%-30% | |
| | Low performers | 16%-30% | > 30% | |
| |
| <p><i>Source: 2021 Accelerate State of DevOps, Google</i></p> |
| |
| <b>Data Sources Required</b> |
| |
| This metric relies on: |
| - `Deployments` collected in one of the following ways: |
| - Open APIs of Jenkins, GitLab, GitHub, etc. |
| - Webhook for general CI tools. |
| - Releases and PR/MRs from GitHub, GitLab APIs, etc. |
| - `Incidents` collected in one of the following ways: |
| - Issue tracking tools such as Jira, TAPD, GitHub, etc. |
| - Incident or Service Monitoring tools such as PagerDuty, ServiceNow, etc. |
| |
| <b>Transformation Rules Required</b> |
| |
| This metric relies on: |
| - Deployment configuration in Jenkins, GitLab, GitHub or BitBucket transformation rules to let DevLake know which CI builds/jobs can be regarded as `Deployments`. |
| - Incident configuration in Jira, GitHub or TAPD transformation rules to let DevLake know which issues can be regarded as `Incidents`. |
| |
| <b>SQL Queries</b> |
| |
| If you want to measure the monthly trend of change failure rate, run the following SQL in Grafana. |
| |
|  |
| |
| ``` |
| with _deployments as ( |
| -- get the deployments in each month |
| SELECT |
| date_format(ct.finished_date,'%y/%m') as month, |
| ct.id AS deployment_id |
| FROM |
| cicd_tasks ct |
| join project_mapping pm on ct.cicd_scope_id = pm.row_id |
| WHERE |
| pm.project_name in ($project) |
| and type = 'DEPLOYMENT' |
| and result = 'SUCCESS' |
| and environment = 'PRODUCTION' |
| ), |
| |
| _incidents as ( |
| -- get the incidents (caused by deployments) that are created within the selected time period in the top-right corner |
| SELECT |
| date_format(i.created_date,'%y/%m') as month, |
| i.id AS incident_id, |
| pim.deployment_id |
| FROM |
| issues i |
| join project_issue_metrics pim on i.id = pim.id |
| WHERE |
| pim.project_name in ($project) and |
| i.type = 'INCIDENT' |
| ), |
| |
| _calendar_months as( |
| -- deal with the month with no incidents |
| SELECT date_format(CAST((SYSDATE()-INTERVAL (month_index) MONTH) AS date), '%y/%m') as month |
| FROM ( SELECT 0 month_index |
| UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 |
| UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 |
| UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 |
| UNION ALL SELECT 10 UNION ALL SELECT 11 |
| ) month_index |
| WHERE (SYSDATE()-INTERVAL (month_index) MONTH) > SYSDATE()-INTERVAL 6 MONTH |
| ), |
| |
| _deployment_failures as ( |
| -- calculate the number of incidents caused by each deployment |
| SELECT |
| distinct |
| cm.month, |
| d.deployment_id, |
| count(distinct i.incident_id) as incident_count |
| FROM |
| _calendar_months cm |
| left join _deployments d on cm.month = d.month |
| left join _incidents i on d.deployment_id = i.deployment_id |
| GROUP BY 1,2 |
| ) |
| |
| SELECT |
| month, |
| case when |
| count(deployment_id) is null then null |
| else count(case when incident_count = 0 then null else incident_count end)/count(deployment_id) end as change_failure_rate |
| FROM _deployment_failures |
| GROUP BY 1 |
| ORDER BY 1 |
| ``` |
| |
| If you want to measure in which category your team falls into, run the following SQL in Grafana. |
| |
|  |
| |
| ``` |
| with _deployments as ( |
| -- get the deployment deployed within the selected time period in the top-right corner |
| SELECT |
| ct.id AS deployment_id, |
| ct.finished_date as deployment_finished_date |
| FROM |
| cicd_tasks ct |
| join project_mapping pm on ct.cicd_scope_id = pm.row_id |
| WHERE |
| pm.project_name in ($project) |
| and type = 'DEPLOYMENT' |
| and result = 'SUCCESS' |
| and environment = 'PRODUCTION' |
| and $__timeFilter(finished_date) |
| ), |
| |
| _incident_caused_by_deployments as ( |
| -- get the incidents (caused by deployments) that are created within the selected time period in the top-right corner |
| SELECT |
| i.id AS incident_id, |
| pim.deployment_id |
| FROM |
| issues i |
| join project_issue_metrics pim on i.id = pim.id |
| WHERE |
| pim.project_name in ($project) and |
| i.type = 'INCIDENT' |
| and $__timeFilter(i.created_date) |
| ), |
| |
| _deployment_failures as ( |
| -- calculate the number of incidents caused by each deployment |
| SELECT |
| distinct |
| d.deployment_id, |
| d.deployment_finished_date, |
| count(distinct i.incident_id) as incident_count |
| FROM |
| _deployments d |
| left join _incident_caused_by_deployments i on d.deployment_id = i.deployment_id |
| GROUP BY 1,2 |
| ), |
| |
| _change_failure_rate as ( |
| SELECT |
| case when count(deployment_id) is null then null |
| else count(case when incident_count = 0 then null else 1 end)/count(deployment_id) end as change_failure_rate |
| FROM |
| _deployment_failures |
| ) |
| |
| SELECT |
| case |
| when change_failure_rate <= .15 then "0-15%" |
| when change_failure_rate <= .20 then "16%-20%" |
| when change_failure_rate <= .30 then "21%-30%" |
| else "> 30%" |
| end as change_failure_rate |
| FROM |
| _change_failure_rate |
| ``` |
| |
| ## How to improve? |
| - Add unit tests for all new feature |
| - "Shift left", start QA early and introduce more automated tests |
| - Enforce code review if it's not strictly executed |