It goes without saying that bugs have a significant impact on company growth. When products do not work as expected, customers get frustrated, and they leave for a competing product or service. More recently, Atlassian's outage has reignited the discussion about the time it takes a company to acknowledge issues and meet its Service Level Agreements (SLA) with clients. Failing to do so can lead to irreversible damage, even for a well-established business.
The right metrics can be decisive in preventing this situation from happening. In this article, we will walk you through the process of how to use data to improve your MTTR.
From identifying and investigating an issue, to digging deeper into what the causes are, and finally, drawing out an action plan and aligning with your team.
Step 1: Identify and Investigate
The first step is to recognize that problems are happening. This realization can come from customers, your engineering team, or your gut feeling telling you something is wrong. As an engineering leader, your job is to quickly identify what is not right, figure out the causes and draw solutions.
So, you start by investigating the number of bugs over time and notice that it increased by 53% in the last weeks:
👉 When classified by priority, you see the number of high and highest priority bugs are the ones that have increased the most, which is alarming.
Step 2: Dig Deeper
Mean Time To Restore
The number of bugs doesn't tell the whole story, so you check the time it’s taking to restore, or your MTTR (Mean Time to Restore). You see it has increased by three times in the last weeks.
However, to really understand why that is the case, you have to go deeper into the metrics...
💡 MTTR depends on your Mean Time To Acknowledge (MTTA) and Mean Time To Repair (MTTRepair), meaning, the time it takes to recognize there is an issue and the time it takes to actually work on fixing the issue.
Analysing MTTR, MTTA and MTTRepair
The MTTR is 3xhigher than usual, so you look at your MTTA and MTTRepair by bug priority and realize the time it takes to recognize high priority issues is four weeks, which is way above normal and your SLA. The time to repair high priority bugs is three days. Also, too long.
Step 3: Find the Causes
Now you know what is happening, but still need to figure out where the root cause is. You need to see if this is linked to a specific team, so you look at a team-by-team comparison to analyse the metrics.
💡 You discover your engineering team is taking longer than usual to acknowledge issues. This is the main cause of your MTTR increase.
Step 4: Align, Act and Measure
Now you have all the information you need to gather your team and discuss how to improve.
Make sure you align with your team and set the right goals for solving this issue, these should be:
- Reduce the time to acknowledge high and highest priority issues
- Improve overall MTTRepair
- Bring down MTTR for customer-facing bugs
By getting your MTTR back on track, you will have a less buggy product, positively impacting the end customer. You can use this process to improve, not just when it comes to bugs continuously, but other aspects of your delivery pipeline. We covered the process for continuous improvement here.
Ready to identify those customer-facing bugs? Find out how Athenian can help your engineering org.