We’ve all been there: send another git push to a pull request and wait for all the triggered CI checks to finish during 30, 40, 60 minutes. Multitask or read some r/programming meanwhile. So annoying (having to multitask, that is). Can we do better? Let’s investigate.
Below is a typical CI job pipeline:
We may target optimizations at each of those stages. Besides, there is a meta-optimization that I will describe in the end.
Booting the executor is the shadiest stage of all. The user has few levers. One important thing to be aware of is that the hardware may alternate from run to run. For example, cloud GitHub Actions operate on different CPUs; some lack popular instruction sets like AVX2. IOPS very, too. I don’t recommend abusing GHA, but it’s possible to re-trigger jobs several times to win the fastest machine.
Containerized executors habitually start faster than traditional VMs, so if your CI SaaS offers a choice, don’t opt for VMs when unnecessary.
The golden rule of fast project code fetches is to download as little as possible. Some CI-s abstract this stage, though every config allows setting the clone depth at least. My best shot so far has been
The universal advice to speed up the setup is to cache everything.
I know two approaches to caching the docker layers: the easy and the ultimate. The easy approach distills to
The ultimate approach is pulling from and pushing to a custom container registry. However, one has still to prune the registry from old stale layers occasionally.
Compile incrementally (C/C++, Rust). But remember about varying CPUs — you may catch dragons if you don’t pin the required CPU properties.
Leverage both CPU cores (see “Wicked executor”).
Avoid building everything at once. E.g., factor out independent packages and offload them to the setup stage.
Use incremental test runners that select which tests to execute based on the diff. Con: it becomes harder to measure the test coverage. Another con: there is little benefit in Python, where it’s common to import half of the world in each module.
Execute unit tests in multiple threads. IO-bound test suites should overcommit the threads, e.g., I am launching four threads on two cores to great success.
If your CI bills you by time and not by parallel executors, or your project is open-source, you can spread the unit tests across several jobs so that instead of executing 100% of the tests in one job, you launch 5 jobs each with ~20%.
Given that the job time formula C + k * W, where C is the constant factor (boot, fetch, setup), k is a linear factor, and W is the volume of executed unit tests, if you split W into N evenly sized pieces, the CI time will decrease to C + k * W / N, and you’ll pay for C * N + k * W. It makes sense to grow N until the billing overhead raises to comparable value C * N ~ k * W, so N ~ k * W / C. Example: 100% of the tests take 40 min, C is 2 min. N ~ 20. The new CI time is 4 min. The new billed time is 80 min. However, some statistical underwater stones suggest a lower N, proceed to the next section.
There is always one critical CI job that finishes last. The following diagram should look familiar:
Surprisingly, the sole optimization of the average job run time is not enough. It would be best if you additionally reduced the standard deviation. Here are some notorious examples:
To hyperbolize, suppose that we’ve got 20 independent CI jobs, each takes 5 minutes with 90% probability and 10 minutes with 10% probability. The odds of all the jobs finishing in 5 minutes are vanishing 0.9²⁰≈12%. So we will wait 10 minutes in the rest 88%, even though the average job run time is only 5*0.9+10*0.1=5.5 minutes. The frustration increases as the number of independent CI jobs grows. The situation is similar to big data processing: there is always a tricky edge quirk in the data that will crash your Spark job and make you start from scratch. There exists a fancy theory underneath, but I am not overwhelming the reader with math formulas.
My recommendations to keep the standard deviation low are:
The last point interests me in particular since my company is building a product to calculate and analyze those metrics. So I’ve come up with the following three: Occupancy, Critical Occupancy, and Imbalance.
The occupancy metric is the ratio between the sum of job run times and the product of the number of jobs with the maximum job run time. This ratio always evaluates between 0 and 1. A zero signals an absolute resource utilization inefficiency, and a one shows an ideal efficiency.
Occupancy reflects on how “dense” is the CI suite overall. Our (hundreds of) clients have an average CI occupancy (95th percentile) of 0.58.
The critical occupancy metric is very similar to the regular Occupancy except that we discard jobs of non-critical types. If at least one job finished the last in the parent CI suite, we call that job’s type critical. According to our three examples, the critical job types are the unit tests in different environments. Also, the docker build is critical in the second example.
Critical Occupancy attempts to exclude quick, lightweight jobs like linting or building documentation and leave only those influencing the overall duration. It’s probably a more fair estimation of the standard deviation severity.
Our clients have an average critical occupancy (95th percentile) of 0.61.
The last metric is the simplest: Imbalance is the difference between the longest and the second-longest job run time.
Imbalance reflects how many opportunities for reducing the overall CI suite duration are from optimizing the critical jobs. If the Imbalance value is only a few seconds, it will be hard to speed up CI by “micro-optimizations.” On the contrary, if the value is a few minutes, the game is worth the candle.
Our clients have an average imbalance (95th percentile) of 3min 48s.
The CI diagrams above are just a simplified model. In reality, there can be jobs starting when other jobs finish, implicit dependency DAGs, etc. Luckily, the vast majority do not configure such complex scenarios.
Like other metrics, Occupany and Imbalance can be uninformative and useless. Occupancy may toss around 0.4..0.7 for no actionable reason, or Imbalance may remain close to zero when there are several identical critical jobs. They are like soft skills: sometimes they work, sometimes they don’t.
I’ve modeled a typical CI job as a sequence of multiple stages: boot the executor, fetch the code, set up the environment, build, execute the payload, submit the artifacts. Then, I proposed a few optimizations for those stages to reduce the run time. Finally, I described the “meta” CI optimization, which lowers the standard deviation of run time, and proposed three related metrics: Occupancy, Critical Occupancy, and Imbalance.
As usual, I will be grateful for any feedback and corrections. Please follow me on Medium to get notified of my next posts. I write about ML/DS applied to software development artifacts, Python, PostgreSQL.