In my last couple of posts, I talked about the 7 Mental Models for engineering leaders and the 5 Pillars that Engineering Leaders should always have on their minds. Both of these come from the talk I recently hosted with Point Nine Capital - where I spoke about what makes great engineering leaders at different stages of the company.
To finish off this series, I will walk you through a process called The Engineering Leader’s Process for Continuous Improvement. From Identifying what you need to improve to measuring impact.
This process came about from observing what the best engineering leaders we know are doing. Of course, many of them are going through this process without a definition or a set framework. But we kept seeing a pattern repeatedly with the most effective and high-performing engineering orgs. So we decided to put it on paper.
By the end of this article, I hope you understand that being an engineering leader is about creating a culture of continuous improvement, which means doing an excellent job providing a great developer experience and delivering impact to your end-user. I will also present the process you can use to achieve this goal, with a clear example of solving a common issue in your engineering org.
As an engineering leader, you have two responsibilities:
These two are infinitely linked, and your job is essentially to create a culture of continuous improvement.
💡 Better developer experience is something you measure internally. Great developer experience leads to a bigger impact on the end-user, which is externally judged.
The mental model that I want to leave you with is a process that I believe every engineering leader would gain from regularly doing. And in many cases you already are...
💡 This is a cycle, a continuous loop. If you look at your day, are you doing this? Are you identifying what's helping things deliver better? Are you identifying how to create a better developer experience? Where do you need to be better?
Scenario: In the last couple of weeks you had a few outages that impacted your customers. You were slow to fix the issues and it broke your SLA.
This is how you would follow the process to fix this problem:
1.1 By looking at your MTTR chart by bug priority you can clearly see that there was an increase in the time to restore over the last couple of weeks:
2.1 Now that you identified the problem your need to have a blameless discussion with the different teams leads to a better understanding of the route cause.
2.2 In this case, you know that the MTTR is linked to a few different leading indicators. It’s important to look at leading indicators, because they are linked to your KPIs, and will help you make a better diagnosis PLUS determine what you will need to measure as you define an action plan.
In this case, your leading indicators are:
a) Number of bugs created (⚠️ is this a general issue?)
b) Number of P1/P2 bugs created
We seem to be creating more bugs in the last couple of weeks and a high number of P1/P2:
🤔 Why? Is it linked to a new project we just released? Are we investing less effort in quality? A legacy app that we’ve been dragging for years and is finally biting us in the ass.
2.3 Mean Time To Acknowledge for P1/P2 bugs
We see that a 5-hour MTTA for a highest priority bug and a 2 day MTTA for high bugs is above our SLA and definitely something we should improve:
By digging further into our data we can see that it seems to be linked to the Data team that was focused on a big release:
2.4 Mean Time To Repair for P1/P2 bugs
3.1 The team decides that the highest impact/lowest effort initiative is to improve the MTTA for the data team.
3.2 You need to hire more people for the data team.
3.3 Make sure priorities are clear = P1 bugs over the new release.
3.4 The team decides to enforce stricter SLAs.
3.5 A new on-call / incident / paging system would also be deployed in the coming months.
4.1 The Goal, KPI, Target and Metrics are defined:
Goal = Reduce resolution time of customer-facing bugs.
KPI = MTTRestore on P1/P2 bugs.
Target = MTTR of 3h for P1 bugs & 6h for P2 bugs.
Metrics = # bugs created; # of P1/P2 bugs created; Mean Time To Acknowledge for P1/P2 bugs; Mean Time To Repair for P1/P2 bugs.
4.2 A quick recap presentation is made to the team
5.1 Follow the steps in the “Decide” stage
6.1 On a weekly/monthly basis, you measure the MTTR and make sure it’s below your set target/SLA
6.2 Make sure you also measure the metrics decided in the Align stage (depending on how big your engineering org is, you might have to check these everyday - but weekly or monthly should be enough)
As you can see, all the steps from the process will lead to a clear, measurable solution. You can also see how both your end-user and your engineering team should be happier by the end of this process.
This is only one in a myriad of examples of using this process. You might want to use it whenever you have a quarterly, monthly, or yearly planning cycle - not just when there is an issue. My goal is really to allow engineering leaders and organizations to make this loop happen flawlessly and effortlessly.