At my new job we’re reigniting an effort to move to continuous delivery for our software releases. We figured that we could learn a thing or two from Facebook, so we reached out to Chuck Rossi, Facebook’s first release engineer and the head of their release engineering team. He generously gave us an hour of his time, offering insights into how Facebook releases software, as well as specific improvements we could make to our existing practice. This post describes several highlights of that conversation.
What’s so good about Facebook release engineering?
The core capability my company wants to reproduce is Facebook’s ability to release its frontend web UI on demand, invisibly and with high levels of control and quality. In fact Facebook does a traditional-style large weekly release each Tuesday, as well as not-so-traditional two daily pushes on all other weekdays. They are also able to release on demand as needed. This capability is impressive in any context; it’s all the more impressive when you consider Facebook’s incredible scale:
- Over 1B users worldwide
- About 700 developers committing against their frontend source code repo
- Single frontend code binary about 1.5GB in size
- Pushed out to many thousands of servers (the number is not public)
- Changes can go from check-in to end users in as quickly as 40 minutes
- Release process almost entirely invisible to the users
Holy cow.
While the release engineering problem for my company is considerably smaller than the one confronting Facebook, it’s not by any means small. (Facebook is so massive that user bases orders of magnitude smaller than Facebook can still have nontrivial scale.) We don’t have to contend with the 1B users, 700 developers, 1.5GB binary or many thousands of servers. But we do want to be able to release on demand, quickly, reliably and invisibly to our users.
How Facebook pushes twice daily to over 1B users
The common thread running through the practices below is that they reject the supposed tradeoff between speed and quality. Releases are going to happen twice a day, and this needs to occur without sacrificing quality. Indeed, the quality requirements are very high. So any approach to quality incompatible with the always-be-pushing requirement is a non-starter.
Here are some of the key themes and techniques.
Empower your release engineers
Chuck mentioned early on that the whole thing rides on having an empowered release engineering team. Ultimately release engineers have to strike a balance between development’s desire to ship software and operations’ desire to keep everything running smoothly. Release engineers therefore need access to the information that tells them whether a given change is a good risk for some upcoming push, as well as the authority to reject changes that aren’t in fact good risks.
At the same time, we want release engineers that “get it” when it comes to software development. We don’t want them blocking changes just because they don’t understand them, or just because they can. Facebook’s release engineers are all programmers, so they understand the importance of shipping software, and they know how to look at test plans, stack traces and the code itself should the need arise.
Empowerment is part cultural, part process and part tool-related.
On the cultural side, Chuck introduces new hires to the release process, and makes it clear that the release engineering team makes the decision.
As part of that presentation, he explains how the development, test and review processes generate data about the risk associated with a change. The highly integrated toolset, based largely around Facebook’s open source Phabricator suite, provides visibility into that change risk data.
Just to give you an idea of the expectation on the developers, there are a number of factors that determine whether a change will go through:
- The size of the diff. Bigger = more risky.
- The quality of the test plan.
- The amount of back-and-forth that occurred in the code review (see below). The more back-and-forth, the more rejections, the more requests for change—the more risk.
- The developer’s “push karma”. Developers with a history of pushing garbage through get more scrutiny. They track this, though any given developer’s push karma isn’t public.
- The day of the week. Mondays are for small, not-risky changes because they don’t want to wreck Tuesday’s bigger weekly release. Wednesdays allow the bigger changes that were blocked for Monday. Thursdays allow normal changes. Changes for Friday can’t be too risky, partly because weekend traffic tends to be heavier than Friday traffic (so they don’t want any nasty weekend surprises), and partly because developers can be harder to reach on weekends.
The release engineers evaluate every change against these criteria, and then decide accordingly. They process 30-300 changes per day.
Test suite should take no longer than the slowest test
When you’re releasing code twice a day, you have to take testing very seriously. Part of this is making sure that developers write tests, and part of this is running the full test suite—including integration and acceptance tests—against every change before pushing it.
In some development organizations, one major challenge with doing this is that integration tests are slow, and so running a full regression against every change becomes impractical. Such organizations—especially those that practice a lot of manual regression testing—often handle this by postponing full regression testing until late in the release cycle. This makes regression testing more cost-feasible because it happens only once per release.
But if we’re trying to push twice daily, the run-regression-at-the-end-of-the-release-cycle approach doesn’t work. And neither does truncating the test suite. We can’t give up the quality.
Facebook’s alternative is simple: apply extreme parallelization such that it’s the slowest integration test that limits the performance of the overall suite. Buy as many machines as are required to make this real.
Now we can run the full battery of tests quickly against every single change. No more speed/quality tradeoff.
Code review EVERYTHING
Chuck was at Google before he joined Facebook, and apparently at both Google and Facebook they review every code change, no matter how small. Whereas some development shops either practice code review only in limited contexts or else not at all, pre-push code reviews are fundamental to Facebook’s development and release process. The process flat out doesn’t work without them.
As the session progressed, I came to understand some reasons why. One key reason is that it promotes the right-sizing of changes so they can be developed, tested, understood and cherry-picked appropriately. Since Facebook releases are based on sets of cherry picks, commits need to be smallish and coherent in a way that reviews promote. And (as noted above) the release engineers depend upon the review process to generate data as to any given change’s riskiness so they can decide whether to perform the cherry pick.
Another important benefit is that pre-push code reviews can make it feasible to pursue a single monolithic code repo strategy (often favored for frontend applications involving multiple components that must be tested together), because breaking changes are much less likely to make it into the central, upstream repo. Facebook has about 700 developers committing against a single source repository, so they can’t afford to have broken builds.
Facebook uses Phabricator (specifically, Differential and Arcanist) for code reviews.
Practice canary releases
Testing and pre-push reviews are critical, but they aren’t the entire quality strategy. The problem is that testing and reviews don’t (and can’t) catch everything. So there has to be a way to detect and limit the impact of problems that make their way into the production environment.
Facebook handles this using “canary releases”. The name comes from the practice of using canaries to test coal mines for the presence of poisonous gases. Facebook starts by pushing to six internal servers that their employees see. If no problems surface, they push to 2% of their overall server fleet and once again watch closely to see how it goes. If that passes, they release to 100% of the fleet. There’s a bunch of instrumentation in place to make sure that no fatal errors, performance issues and other such undesirables occur during the phased releases.
Decouple stuff
Chuck made a number of suggestions that I consider to fall under the general category “decouple stuff”. Whereas many of the previous suggestions were more about process, the ones below are more architectural in nature.
Decouple the user from the web server. Sessions are stateless, so there’s no server affinity. This makes it much easier to push without impacting users (e.g., downtime, forcing them to reauthenticate, etc.). It also spreads the pain of a canary-test-gone-wrong across the entire user population, thus thinning it out. Users who run into a glitch can generally refresh their browser to get another server.
Decouple the UI from the service. Facebook’s operational environment is extremely large and dynamic. Because of this, the environment is never homogeneous with respect to which versions of services and UI are running on the servers. Even though pushes are fast, they’re not instantaneous, so there has to be an accommodation for that reality.
It becomes very important for engineers to design with backward and forward compatibility in mind. Contracts can evolve over time, but the evolution has to occur in a way that avoids strong assumptions about which exact software versions are operating across the contract.
Decouple pushes from feature activation. Facebook uses dark launches and feature flags to decouple binary pushes from the activation of features. The general concept is for the features to exist in latent form in the production environment, with a means to activate and deactivate them at will.
Dark launches and feature flags further erode the speed/quality tradeoff. You can release code without activating it, giving you a way to get it out the door without impacting users. And when you do activate it, you have a way to turn it off immediately should a problem arise. These techniques also simplify source code management because you can just manage everything on trunk instead of having a bunch of branches sitting around waiting to be merged.
Facebook uses an internally-developed tool called Gatekeeper to manage feature flags. Gatekeeper allows Facebook to turn feature flags on and off, and to do that in a flexibly segmented fashion.
Recap and concluding thoughts
I mentioned earlier that Facebook rejects the apparent tradeoff between speed and quality. At their core, the practices above amount to ways to maintain quality in the face of rapid fire releases. As the overall release practice and infrastructure matures, opportunities for further speedups and quality enhancements emerge.
As you can see, our one hour conversation was packed with a lot of outstanding information. I hope that others might benefit from this material in the way that I know my company will. Thanks Chuck!
Additional resources for Facebook release engineering
Facebook publishes a great deal of useful information about their release engineering processes. Here are some good resources to learn more, mostly directly from Chuck himself.
- Push: Tech Talk – May 26, 2011 (video): This is a class that Chuck gives to new developers when they join Facebook. It’s just slightly out of date as Facebook now does two daily pushes instead of one. Outstanding information about release schedule, branching strategy, cultural norms, tools and more. Just under an hour but well worth the watch.
- Release engineering and push karma: Chuck Rossi: Interview covering some highlights of the Facebook release process and its supporting culture.
- Ship early and ship twice as often: Chuck explains how Facebook moved from a once-per-day push schedule to a twice-per-day schedule.
- Release Engineering at Facebook: Secondary source with highlights on the Facebook release process.
- Hammering Usernames: Facebook explains how they use dark launches to mitigate risk.
- Girish Patangay keynote Velocity Europe 2012 “Move Fast and Ship Things” (video) – Keynote by Facebook’s Girish Patangay describing some additional elements of the Facebook release process, including its use of a BitTorrent-based system to push a large binary very quickly out to many thousands of servers.
By Amit Behal December 3, 2012 - 2:38 pm
Willie thanks for putting this article together to give us an insight into the RM processes at Facebook. The content of the article does not spell out anything new or earth-shattering as it is either common knowledge or part of best practices in use across the industry, so that brings to the forefront the question – what is the exact nature of the problem that you and/or your team is trying to solve.
By Willie Wheeler December 4, 2012 - 6:29 pm
Amit–great to see you here.
Fair question. The answer is that we’re trying to get high quality weekly and daily releases working for a large web app. Nothing too special in the goal there.
We do have a monolithic source repo, and Facebook does as well. So we wanted some pointers on how they manage to avoid broken builds with 700 people committing against it.
While I agree that the concepts above are largely well-known, there are some interesting points and subtleties to consider.
1) It’s interesting that Facebook is able to use more or less standard prescriptions to manage releases at massive scale. A best practice could fall down well before 1B users without losing its claim to being a best practice. Facebook’s experience says that these practices–layered appropriately–scale just fine to releasing to at least 1B users.
2) Another interesting point is the foundational role of cherry picking and code review in Facebook’s process. The usual advice doesn’t mention cherry picking at all. Instead it says to keep the trunk releasable so you can deploy when you like.
Facebook would agree that the trunk should be kept in good working order, but they have a strong definition of “releasable” that requires an approach that goes beyond simply keeping the build green. Whenever Facebook pushes out a change, the engineer responsible for that change needs to be hanging around to make sure that he’s ready to help if something goes wrong.
If you simply keep the trunk green, then anytime you want to do a daily release, you have to have the whole team around, because presumably everybody’s been committing code. That’s hard. So there’s this speed/quality tradeoff: you either wait for the end of a sprint to do your releases (speed hit), or else you drop the requirement that every engineer has to be hanging out to shepherd their change through to production (quality hit).
3) Test parallelization is clearly a well-known concept, but the level of aggressiveness described in the post is not as well-known. (Happy to be educated on that point.) In a lot of the reading I’ve done, people talk about primarily running unit tests as part of the commit stage of the CI build, with a handful of strategically chosen integration and/or acceptance tests in the mix to detect known risks. The Continuous Delivery book, for example, presents that approach. And while I think it’s a fine approach, the Facebook extreme parallelization approach means that you don’t have to give up either speed or quality when you commit: you can run the entire test suite (unit, integration, acceptance) on every single commit. I think this idea may be cost prohibitive for some orgs, so that could be why it doesn’t seem to come up.
By Amit Behal December 7, 2012 - 10:58 am
Hi Willie
The fact that your team is having to deal with a monolithic source-repository with scores of developers working on the project checking-in their code does make it complicated and therefore a challenging proposition, and I see the need to reach out to Chuck Rossi in trying to see how Facebook solved their problem.
1. Yes, it is interesting indeed that Facebook is able to manage releases at massive scales BUT it is also true that they are able to do so, because of the confidence that they have built in both their processes and infrastructure design to get to that stage. Facebook’s release team releases code utilizing Release Trains (from reading what’s publicly available) – So, on days that they have major releases scheduled for come rain or hale, the releases go through and the minor releases (and any code that missed the release train deadline) released M,W,Th – Fridays being the no surprises day of-course.
Facebook deploys and test code – ‘major or minor’ internally before sending out to end-users. Since not much is known about how Facebook handles hot-fixes, I would assume the same process may/may-not be followed to patch the issue that does make it into Production.
2. For the trunk to comply with Facebook’s definition of ‘release-able’ I agree with your assessment of them having tight controls to maintain it. CI will only you take a distance, and a battery of tests(kept updated) on the CI build (done on a defined schedule, I would assume AND/OR on-demand by the RE team) in a staging server will flag any issues that it uncovers and the fact use of the ‘karma rating’ of the developer(s) checking-in their code.
At one of the organizations that I managed the release processes, my team (RM) would review each change with a fine comb having the authority to approve/dis-approve a change request thus come morning when the changes needed supporting we already had reviewed and approved changes AND a list of reviewed and unapproved changes that would either have updates from the development teams over-night OR during the course of the day before their releases were to be deployed; the un-approved changes would then be re-reviewed by my onsite team and be either approved or rejected with comments. Following this process led to 0.08% of changes requiring back-out in Production (an average of 8,000-10,000 releases /yr – combination of releases in all environments). While I did not have a ‘karma rating’ system in place, my interaction with the developers on a daily basis made my team and I well aware of the team/developer(s) delivering bad quality code.
I have in my previous engagements required of the developer or developer-delegate be available before a production deployment done and only when they would complete a test-plan on the newly deployed code have the code actually released to the live-traffic (load-balancers help
) – should a developer not show-up for their code-deployment, the code would be held back with the developer/team requiring a good justification / business reason to be unavailable for the deployment/release. So, the fact that Facebook also requires of the developer(s) responsible for their checked-in code to be available does not surprise me at all. With regards to the speed/quality tradeoff – that problem could be solved easily in a onsite/offshore environment, where-in the team at one of the locations is available to cover the release.
3. Performance and Capacity planning are well-known but then not many an organizations have it implemented as part of their release process – I know, because I have either worked at a few that did and some that did not. Its organizations that have their code highly distributed to make it easy to develop/maintain/release/support that have to cherry-pick which application(s) should go through with the P&C tests. Only when the P&C team signs-off on the release should it be made available to be deployed to production and released to live-traffic.
Infrastructure does not come cheap, and not many an organizations can afford to invest in elaborate staging/test environments like Facebook does BUT if most-of your business if not all of it comes from your website/application having HA and performing well within the SLA’s – business teams should not hold back. All that is needed, simply put is to show to the business team the benefits they stand to reap on their investments.
By Willie Wheeler December 7, 2012 - 11:16 pm
Hi Amit. Thanks for the thoughtful comments. My responses:
1) I agree–Facebook’s capability is based on their processes and infrastructure, among other things (e.g. culture, tools). Hopefully that came out in the post.
2) One of the things Chuck mentioned on push karma was that for a while he knew (like you) which developers pushed risky code. But as the org grew, it became impossible to track mentally. But Facebook’s approach is a little different than what you describe. The FB release engineering team has five members at the moment, and they process up to 300 changes every single day. So they can’t really review each change with a fine-toothed comb. Instead they rely on Phabricator to aggregate the data that I mention in the article and display a simple bar chart that gives a quick understanding of the risk. As long as the risk is low and the developer is on the IRC, they cherry-pick the change. Of course, the detailed information is at hand should they need to drill down for whatever reason.
3) I agree with you that while the cost of the test infrastructure might be high, the cost of not having sufficient capacity on hand might be higher still. It’s up to each org to do the math.
Thanks again for your insights.