At my new job we’re reigniting an effort to move to continuous delivery for our software releases. We figured that we could learn a thing or two from Facebook, so we reached out to Chuck Rossi, Facebook’s first release engineer and the head of their release engineering team. He generously gave us an hour of his time, offering insights into how Facebook releases software, as well as specific improvements we could make to our existing practice. This post describes several highlights of that conversation.
What’s so good about Facebook release engineering?
The core capability my company wants to reproduce is Facebook’s ability to release its frontend web UI on demand, invisibly and with high levels of control and quality. In fact Facebook does a traditional-style large weekly release each Tuesday, as well as not-so-traditional two daily pushes on all other weekdays. They are also able to release on demand as needed. This capability is impressive in any context; it’s all the more impressive when you consider Facebook’s incredible scale:
- Over 1B users worldwide
- About 700 developers committing against their frontend source code repo
- Single frontend code binary about 1.5GB in size
- Pushed out to many thousands of servers (the number is not public)
- Changes can go from check-in to end users in as quickly as 40 minutes
- Release process almost entirely invisible to the users
While the release engineering problem for my company is considerably smaller than the one confronting Facebook, it’s not by any means small. (Facebook is so massive that user bases orders of magnitude smaller than Facebook can still have nontrivial scale.) We don’t have to contend with the 1B users, 700 developers, 1.5GB binary or many thousands of servers. But we do want to be able to release on demand, quickly, reliably and invisibly to our users.
How Facebook pushes twice daily to over 1B users
The common thread running through the practices below is that they reject the supposed tradeoff between speed and quality. Releases are going to happen twice a day, and this needs to occur without sacrificing quality. Indeed, the quality requirements are very high. So any approach to quality incompatible with the always-be-pushing requirement is a non-starter.
Here are some of the key themes and techniques.
Empower your release engineers
Chuck mentioned early on that the whole thing rides on having an empowered release engineering team. Ultimately release engineers have to strike a balance between development’s desire to ship software and operations’ desire to keep everything running smoothly. Release engineers therefore need access to the information that tells them whether a given change is a good risk for some upcoming push, as well as the authority to reject changes that aren’t in fact good risks.
At the same time, we want release engineers that “get it” when it comes to software development. We don’t want them blocking changes just because they don’t understand them, or just because they can. Facebook’s release engineers are all programmers, so they understand the importance of shipping software, and they know how to look at test plans, stack traces and the code itself should the need arise.
Empowerment is part cultural, part process and part tool-related.
On the cultural side, Chuck introduces new hires to the release process, and makes it clear that the release engineering team makes the decision.
As part of that presentation, he explains how the development, test and review processes generate data about the risk associated with a change. The highly integrated toolset, based largely around Facebook’s open source Phabricator suite, provides visibility into that change risk data.
Just to give you an idea of the expectation on the developers, there are a number of factors that determine whether a change will go through:
- The size of the diff. Bigger = more risky.
- The quality of the test plan.
- The amount of back-and-forth that occurred in the code review (see below). The more back-and-forth, the more rejections, the more requests for change—the more risk.
- The developer’s “push karma”. Developers with a history of pushing garbage through get more scrutiny. They track this, though any given developer’s push karma isn’t public.
- The day of the week. Mondays are for small, not-risky changes because they don’t want to wreck Tuesday’s bigger weekly release. Wednesdays allow the bigger changes that were blocked for Monday. Thursdays allow normal changes. Changes for Friday can’t be too risky, partly because weekend traffic tends to be heavier than Friday traffic (so they don’t want any nasty weekend surprises), and partly because developers can be harder to reach on weekends.
The release engineers evaluate every change against these criteria, and then decide accordingly. They process 30-300 changes per day.
Test suite should take no longer than the slowest test
When you’re releasing code twice a day, you have to take testing very seriously. Part of this is making sure that developers write tests, and part of this is running the full test suite—including integration and acceptance tests—against every change before pushing it.
In some development organizations, one major challenge with doing this is that integration tests are slow, and so running a full regression against every change becomes impractical. Such organizations—especially those that practice a lot of manual regression testing—often handle this by postponing full regression testing until late in the release cycle. This makes regression testing more cost-feasible because it happens only once per release.
But if we’re trying to push twice daily, the run-regression-at-the-end-of-the-release-cycle approach doesn’t work. And neither does truncating the test suite. We can’t give up the quality.
Facebook’s alternative is simple: apply extreme parallelization such that it’s the slowest integration test that limits the performance of the overall suite. Buy as many machines as are required to make this real.
Now we can run the full battery of tests quickly against every single change. No more speed/quality tradeoff.
Code review EVERYTHING
Chuck was at Google before he joined Facebook, and apparently at both Google and Facebook they review every code change, no matter how small. Whereas some development shops either practice code review only in limited contexts or else not at all, pre-push code reviews are fundamental to Facebook’s development and release process. The process flat out doesn’t work without them.
As the session progressed, I came to understand some reasons why. One key reason is that it promotes the right-sizing of changes so they can be developed, tested, understood and cherry-picked appropriately. Since Facebook releases are based on sets of cherry picks, commits need to be smallish and coherent in a way that reviews promote. And (as noted above) the release engineers depend upon the review process to generate data as to any given change’s riskiness so they can decide whether to perform the cherry pick.
Another important benefit is that pre-push code reviews can make it feasible to pursue a single monolithic code repo strategy (often favored for frontend applications involving multiple components that must be tested together), because breaking changes are much less likely to make it into the central, upstream repo. Facebook has about 700 developers committing against a single source repository, so they can’t afford to have broken builds.
Facebook uses Phabricator (specifically, Differential and Arcanist) for code reviews.
Practice canary releases
Testing and pre-push reviews are critical, but they aren’t the entire quality strategy. The problem is that testing and reviews don’t (and can’t) catch everything. So there has to be a way to detect and limit the impact of problems that make their way into the production environment.
Facebook handles this using “canary releases”. The name comes from the practice of using canaries to test coal mines for the presence of poisonous gases. Facebook starts by pushing to six internal servers that their employees see. If no problems surface, they push to 2% of their overall server fleet and once again watch closely to see how it goes. If that passes, they release to 100% of the fleet. There’s a bunch of instrumentation in place to make sure that no fatal errors, performance issues and other such undesirables occur during the phased releases.
Chuck made a number of suggestions that I consider to fall under the general category “decouple stuff”. Whereas many of the previous suggestions were more about process, the ones below are more architectural in nature.
Decouple the user from the web server. Sessions are stateless, so there’s no server affinity. This makes it much easier to push without impacting users (e.g., downtime, forcing them to reauthenticate, etc.). It also spreads the pain of a canary-test-gone-wrong across the entire user population, thus thinning it out. Users who run into a glitch can generally refresh their browser to get another server.
Decouple the UI from the service. Facebook’s operational environment is extremely large and dynamic. Because of this, the environment is never homogeneous with respect to which versions of services and UI are running on the servers. Even though pushes are fast, they’re not instantaneous, so there has to be an accommodation for that reality.
It becomes very important for engineers to design with backward and forward compatibility in mind. Contracts can evolve over time, but the evolution has to occur in a way that avoids strong assumptions about which exact software versions are operating across the contract.
Decouple pushes from feature activation. Facebook uses dark launches and feature flags to decouple binary pushes from the activation of features. The general concept is for the features to exist in latent form in the production environment, with a means to activate and deactivate them at will.
Dark launches and feature flags further erode the speed/quality tradeoff. You can release code without activating it, giving you a way to get it out the door without impacting users. And when you do activate it, you have a way to turn it off immediately should a problem arise. These techniques also simplify source code management because you can just manage everything on trunk instead of having a bunch of branches sitting around waiting to be merged.
Facebook uses an internally-developed tool called Gatekeeper to manage feature flags. Gatekeeper allows Facebook to turn feature flags on and off, and to do that in a flexibly segmented fashion.
Recap and concluding thoughts
I mentioned earlier that Facebook rejects the apparent tradeoff between speed and quality. At their core, the practices above amount to ways to maintain quality in the face of rapid fire releases. As the overall release practice and infrastructure matures, opportunities for further speedups and quality enhancements emerge.
As you can see, our one hour conversation was packed with a lot of outstanding information. I hope that others might benefit from this material in the way that I know my company will. Thanks Chuck!
Additional resources for Facebook release engineering
Facebook publishes a great deal of useful information about their release engineering processes. Here are some good resources to learn more, mostly directly from Chuck himself.