Articles

My Continuous Integration Takes Too Much Time. How Do I Fix It?

by
Vadim Markovtsev
March 23, 2022
xkcd comic remixed by Author. CC BY-NC 2.5.

We’ve all been there: send another git push to a pull request and wait for all the triggered CI checks to finish during 30, 40, 60 minutes. Multitask or read some r/programming meanwhile. So annoying (having to multitask, that is). Can we do better? Let’s investigate.

Below is a typical CI job pipeline:

  1. Boot the executor. Depending on the CI, it can be a container, a VM, or a cloud instance.
  2. Fetch the checked source code. A git clone.
  3. Setup for the payload: install the dependencies, the compilers, the linters, etc.
  4. Build. For interpreted languages like Python, this can be a pip install of the project’s package.
  5. Payload: run the unit tests, verify, ensure the code style, etc.
  6. Submit the artifacts.

We may target optimizations at each of those stages. Besides, there is a meta-optimization that I will describe in the end.

Wicked executor

Booting the executor is the shadiest stage of all. The user has few levers. One important thing to be aware of is that the hardware may alternate from run to run. For example, cloud GitHub Actions operate on different CPUs; some lack popular instruction sets like AVX2. IOPS very, too. I don’t recommend abusing GHA, but it’s possible to re-trigger jobs several times to win the fastest machine.

Open-source build engineers often miss that the executor frequently has multiple cores. Travis has two, GHA has two, Azure Pipelines has two. I’ve seen CI scripts that do make -j2 or execute unit tests in several threads.

Containerized executors habitually start faster than traditional VMs, so if your CI SaaS offers a choice, don’t opt for VMs when unnecessary.

Crispy code fetches

The golden rule of fast project code fetches is to download as little as possible. Some CI-s abstract this stage, though every config allows setting the clone depth at least. My best shot so far has been

git clone --single-branch --branch <sha> --no-tags --depth 1

 — reduce network transfer by skipping all Git objects except those referenced by the commit we are building. GitHub Actions does equivalent optimizations during git fetch:

git init .
git remote add origin https://github.com/...
git -c protocol.version=2 fetch --no-tags --prune --progress --no-recurse-submodules --depth=1 origin <sha>

Faster setups

The universal advice to speed up the setup is to cache everything.

  • Cache the installed packages/modules/etc. (Python, Ruby, Node, Golang).
  • I am unfamiliar with Java and C#, but I bet there is something to persist with them, too.
  • Cache the docker layers and harness the BuildX features.

I know two approaches to caching the docker layers: the easy and the ultimate. The easy approach distills to

  1. docker load layers.tar
  2. docker build --cache-from <image>
  3. docker save <layers> -o layers.tar

There are two problems with those: they are slow and are prone to the snowball effect of layers.tar growing after each build, so <layers> should be chosen with care. BuildX removes the need for docker load / store and is thus faster.

The ultimate approach is pulling from and pushing to a custom container registry. However, one has still to prune the registry from old stale layers occasionally.

Build smarter

Compile incrementally (C/C++, Rust). But remember about varying CPUs —  you may catch dragons if you don’t pin the required CPU properties.

Leverage both CPU cores (see “Wicked executor”).

Shave off some time with development installations in Python (pip install -e .).

Avoid building everything at once. E.g., factor out independent packages and offload them to the setup stage.

“Impload” the payload

Use incremental test runners that select which tests to execute based on the diff. Con: it becomes harder to measure the test coverage. Another con: there is little benefit in Python, where it’s common to import half of the world in each module.

Execute unit tests in multiple threads. IO-bound test suites should overcommit the threads, e.g., I am launching four threads on two cores to great success.

The following is somewhat obvious yet worth a mention: don’t hardcode sleep()-s. First, because there always happens a case when the hardcoded value is not enough, and second because it’s a pure waste of resources. If you really need to sleep, wrap tiny sleep()-s in a loop together with the exit check.

If your CI bills you by time and not by parallel executors, or your project is open-source, you can spread the unit tests across several jobs so that instead of executing 100% of the tests in one job, you launch 5 jobs each with ~20%.

Given that the job time formula C + k * W, where C is the constant factor (boot, fetch, setup), k is a linear factor, and W is the volume of executed unit tests, if you split W into N evenly sized pieces, the CI time will decrease to C + k * W / N, and you’ll pay for C * N + k * W. It makes sense to grow N until the billing  overhead raises to comparable value C * N ~ k * W, so N ~ k * W / C. Example: 100% of the tests take 40 min, C is 2 min. N ~ 20. The new CI time is 4 min. The new billed time is 80 min. However, some statistical underwater stones suggest a lower N, proceed to the next section.

Meta considerations

There is always one critical CI job that finishes last. The following diagram should look familiar:

There is always one critical CI job that finishes last.

Surprisingly, the sole optimization of the average job run time is not enough. It would be best if you additionally reduced the standard deviation. Here are some notorious examples:

Adverse effects of the high standard deviation of the job run time on the overall CI duration.

To hyperbolize, suppose that we’ve got 20 independent CI jobs, each takes 5 minutes with 90% probability and 10 minutes with 10% probability. The odds of all the jobs finishing in 5 minutes are vanishing 0.9²⁰≈12%. So we will wait 10 minutes in the rest 88%, even though the average job run time is only 5*0.9+10*0.1=5.5 minutes. The frustration increases as the number of independent CI jobs grows. The situation is similar to big data processing: there is always a tricky edge quirk in the data that will crash your Spark job and make you start from scratch. There exists a fancy theory underneath, but I am not overwhelming the reader with math formulas.

My recommendations to keep the standard deviation low are:

  • Limit requests to web APIs.
  • Depend on as few third parties as possible. For example, redirect pulls to a private container registry instead of exploring the rate limit of a public one.
  • Prefer memory caches over disk IO.
  • Put strict deadlines everywhere.
  • Monitor the metrics.

The last point interests me in particular since my company is building a product to calculate and analyze those metrics. So I’ve come up with the following three: Occupancy, Critical Occupancy, and Imbalance.

Occupancy

The occupancy metric is the ratio between the sum of job run times and the product of the number of jobs with the maximum job run time. This ratio always evaluates between 0 and 1. A zero signals an absolute resource utilization inefficiency, and a one shows an ideal efficiency.

Examples of Occupancy calculation.

Occupancy reflects on how “dense” is the CI suite overall. Our (hundreds of) clients have an average CI occupancy (95th percentile) of 0.58.

Critical Occupancy

The critical occupancy metric is very similar to the regular Occupancy except that we discard jobs of non-critical types. If at least one job finished the last in the parent CI suite, we call that job’s type critical. According to our three examples, the critical job types are the unit tests in different environments. Also, the docker build is critical in the second example.

Examples of Critical Occupancy calculation.

Critical Occupancy attempts to exclude quick, lightweight jobs like linting or building documentation and leave only those influencing the overall duration. It’s probably a more fair estimation of the standard deviation severity.

Our clients have an average critical occupancy (95th percentile) of 0.61.

Imbalance

The last metric is the simplest: Imbalance is the difference between the longest and the second-longest job run time.

Examples of Imbalance calculation.

Imbalance reflects how many opportunities for reducing the overall CI suite duration are from optimizing the critical jobs. If the Imbalance value is only a few seconds, it will be hard to speed up CI by “micro-optimizations.” On the contrary, if the value is a few minutes, the game is worth the candle.

Our clients have an average imbalance (95th percentile) of 3min 48s.

Caveats

The CI diagrams above are just a simplified model. In reality, there can be jobs starting when other jobs finish, implicit dependency DAGs, etc. Luckily, the vast majority do not configure such complex scenarios.

Like other metrics, Occupany and Imbalance can be uninformative and useless. Occupancy may toss around 0.4..0.7 for no actionable reason, or Imbalance may remain close to zero when there are several identical critical jobs. They are like soft skills: sometimes they work, sometimes they don’t.

Summary

I’ve modeled a typical CI job as a sequence of multiple stages: boot the executor, fetch the code, set up the environment, build, execute the payload, submit the artifacts. Then, I proposed a few optimizations for those stages to reduce the run time. Finally, I described the “meta” CI optimization, which lowers the standard deviation of run time, and proposed three related metrics: Occupancy, Critical Occupancy, and Imbalance.

As usual, I will be grateful for any feedback and corrections. Please follow me on Medium to get notified of my next posts. I write about ML/DS applied to software development artifacts, Python, PostgreSQL.

Read also: How we optimized Python API server code 100x and How we optimized PostgreSQL queries 100x.

If you are an engineering leader who aspires to build a continuous improvement software development culture, check out how Athenian uncovers your CI/CD process.