The empirical laws of how long it takes to review and release a pull request depending on its size. We inferred them from 100,000 pull requests to commercial closed-source repositories belonging to our software development analytics SaaS clients.
I like to read posts that describe software development best practices. It’s fun to compare your own experience with other people’s. I admit that I don’t always follow them since so many authors like to position their findings as flawless and unexceptionable. My 15 years of software engineering have taught me that everything is a compromise, and every decision depends on the context. And the context is often different from another FAANG’s insight. Anyway, there is one specific best practice that is hard to argue with.
💡 Good developers don’t create big PRs.
💡 Developers should review no more than 200 to 400 lines of code (LOC) at a time.
💡 (…)the optimal pull request size is small.
💡 Make smaller pull requests.
Intuitively, the quality and duration of a code review of 10 LOC vary drastically from 10k LOC. The probability of failing the unit tests in the CI grows correspondingly. Big PR size is often a consequence of several features melded together, which hinders the granular reverts. Etc., etc.
The negative influence of PR size on whatever is easy to prove on the public repositories. For example, Tapajit Dey and Audris Mockus studied PRs to NPM packages in their ACM paper. There is much less evidence for proprietary projects, for obvious reasons.
I am working at Athenian. We are building a SaaS product that reveals various metrics of the software development lifecycle. All our clients are mere mortals that have private repositories on GitHub.com, nothing compared to Fortune 100. So I thought that sharing some aggregations over their data relative to pull request size would be interesting for everybody, and launched my Jupyter. The sample size appeared to be around 460,000 PRs, though only a fraction of them was reviewed or released.
I measured two PR properties: the review time and the lead time. Both are not trivial to calculate because of the numerous edge cases.
This is how we measure the PR review time in our product. The end of the review period is easier to define: it is the time of the last review, either a rejection, a neutral comment, or an approval. If this never happened to a PR, we exclude it from the calculation. If a PR is not closed — either rejected or merged — we neither take it into account because there is no guarantee there won’t be new reviews in the future. We discard reviews that happened after the PR was closed and self-reviews — GitHub allows leaving neutral resolutions in both cases. Finally, we ignore reviews from bots.
Finding the time when the reviews began is trickier. There are various behavioral patterns to support:
We take the timestamp of the last commit before the first review in (1).
We take the timestamp of the first review request in (2) and (3).
We take the timestamp of the first comment in (4).
These heuristics are not perfect. For example, we could apply NLP to understand what people are commenting in (4). Some developers never review PRs with non-passing CI, so our logic in (3) can make a wrong assumption. The model tries to abstract human behavior, so it will never be 100% correct but can always be improved. Your suggestions are very welcome!
Below is the scatter plot of PRs in the log scale. As with any other distribution of artifacts of human origin, ours has a long tail and does not fit well in the regular linear coordinates.
The straight horizontal and vertical lines appear because of the quantization: PRs may have only integer changed lines, and GitHub API returns timestamps rounded to seconds.
The vague stripes at the top are more interesting: they indicate the working hours. That is, if you open a PR at time X, you will unlikely receive a review at time X + 12 hours unless you work in a team distributed across distant time zones. Not all Athenian clients do, apparently.
The second property of the distribution of the artifacts of human origin is that there are many outliers. The easiest way to reduce the influence of outliers is to compute robust statistics, such as the median instead of the average. Let’s see how the median review time depends on the PR size.
I cut the size range to 10 intervals with the same number of PRs and obtained 10 points (median of the size interval, median review time on the size interval).
The least squares fit of those 10 points states that we can expect our median review time to be equal to 8 minutes multiplied by the number of changed lines in the PR to the power of ⅔. Particularly, it means that if our PRs have 0 changed lines, — for example, if we rename a few files, —then half of them will be reviewed only once, and fast.
Now, pull request lead time is not a well-known metric. We define it as the time between when the work on the PR began and when the PR was released.
The work on the PR begins when the developer starts coding or thinking about coding. It’s not possible to figure out the exact timestamp, of course. We approximate it by the minimum of two timestamps:
If the client enabled JIRA integration, we can add the third timestamp when the mapped JIRA issue first appeared in progress.
Releasing a PR is more complex to get right. I’ve spent numerous days investigating edge cases, fixing bugs, and rewriting the logic of deciding which GitHub release contains which pull requests. Basically, the product supports three release definitions:
All three variants involve traversing the Git DAG and scanning for PR merge commits. GitHub allows three different merge strategies, including squashes, but that’s fine. What’s not fine, unfortunately, is that following rebases and otherwise force pushes erase the original merge hashes. We’ve coded clever heuristics to match PRs by merge commit titles, they work 99.9% of the time according to our extensive evaluation. Nevertheless, if we formulate the general task of matching rebased Git commits, then I submit. Do you know any solutions?
Below is the scatter plot of released PRs on the log scale.
We notice the same striped pattern in the middle that I attribute to the working hours, too. Besides, there are a few anomalies at the bottom; they belong to various bot commits that we don’t exclude at this time.
We also see that some PRs take more than a year to release! They are mostly errors of the release matching algorithm. One known problem that we haven’t solved yet is what happens when somebody creates the first tag in a repository that has lived for a long time already: all the PRs get suddenly released in that tag.
Let’s check out how our median-median chart looks like.
The robust fit of the curve gives another law: we expect the median PR lead time to be equal to 4½h + 4h × square root of the number of changed lines. Getting the exact square root is an amazing coincidence.
Even if PRs have zero changed lines, it takes at least 4 and a half hours to release half of them. This is the constant factor, the additive value that does not depend on the PR size and sets the minimum lead time expectation.
The median lead time law indicates that the bigger the PR grows, the more time it takes to merge and release. The size correlates with the amount of work done in the PR, which correlates with the amount of time spent, no surprise. Yet, the dependency is the square root, not the linear. Is it really worth splitting a PR in two or mixing changes together? The math will judge.
This is quite the opposite of the best practice! In fact, this inequation will hold for any power ≤1. What’s going on?
There are several answers. First, if the changes are independent and you submit them in parallel, you’ll reverse:
Second, similarly, the average lead time always decreases with smaller PRs:
In other words, if you need to trick the metric, split your work into smaller pieces: the net impact will be the same, you’ll spend more time overall, but the average lead time will improve and the manager will be happy.
Third, the consequences matter. Larger PRs are more likely to be lower quality, contain more bugs, etc. Writing the code is just the tip of the iceberg. The lead time fails to capture that.
Fourth, your development processes matter. If the incoming PRs are not going through the CI or the unit tests are poor, if the reviews are weak, formal, or nonexistent, then pushing bigger chunks of code is obviously faster. Why mess with PRs at all then?
Fifth, there is a certain bias in the source data. Our clients have very few PRs that exceed 1k LOC (~6%). We don’t know how many of them were really written from scratch. So there is no guarantee that a + b√x will stay true at scale.
We’ve seen how PR review and lead times depend on its size. Our sample size was about 100,000, PRs belong to private companies and are not open source. Both times tend to grow as the size increases, but the power is different: the median review time raises faster. The dependencies are sub-linear, and one cannot talk about slowing down the development by two times on twice as big pull requests. This contradicts with the best practices at first glance, but has an explanation and we don’t recommend enlarging your PRs.
If you liked this post, please check out Athenian: we provide best in class analytics over GitHub and JIRA. And we don’t encourage the managers to blindly follow the metrics and to compare individuals!