Top ten pops and drops with Splunk, part 2

In our previous post, we looked at how to use Splunk to identify the most volatile event codes over the past day. In this post we’ll revisit that topic in light of some good feedback, and then look at how to get the detailed hour-by-hour breakdown.

The summary query, revisited

Deb Dey noted in response to the previous post that the hourly change metric–a ratio–is undefined when the denominator is zero. At the time I wasn’t sure whether this concern is real or purely academic, but experience with the query immediately revealed that the concern is very much real. One scenario we want to be able to detect is where a release breaks something that was formerly working. This scenario actually occurred while experimenting with the metric and revealed the need for a fix.

After considering different alternatives, we eventually settled on using the ratio (today’s count + 1) / (yesterday’s count + 1). This solves the zero denominator problem without unduly skewing the ratio.

We’ve made a few other improvements as well:

  • For monitoring purposes, we care more about pops than drops here, so we’ve modified the query to focus on pops instead of pops and drops.
  • The original query looked at today’s events, Bellevue time. The problem is that we get great focus on recent events at the beginning of the Bellevue day, but a more washed-out view of the day’s events at the end of the Bellevue day. So we changed the window to a rolling window.
  • We’re now looking at an 8-hour window, since this gives us a view into more recent events.

Here’s the updated query:

index=summary reporttype=eventcount reporttime=hourly [email protected] latest=now
  | eval currHour = strftime(now(), "%H")
  | eval prevHour = currHour - 1
  | where (prevHour - date_hour) % 24 < 8
  | eval near = if(_time >= relative_time(now(), "[email protected]"), 1, 0)
  | stats sum(count) as jfreq by date_hour EventCode near
  | eval tmp_nearFreq = if(near == 1, jfreq, 0)
  | eval tmp_farFreq = if(near == 0, jfreq, 0)
  | stats max(tmp_farFreq) as farFreq, max(tmp_nearFreq) as nearFreq by date_hour EventCode
  | where nearFreq > farFreq
  | eval change = (nearFreq + 1) / (farFreq + 1)
  | stats sumsq(change) as ssc by EventCode
  | sort ssc desc
  | head 10
  | eval magnitude = log(ssc)

And here’s what it generates (simulated data):

EventCode magnitude
3232 6.1
12 5.1
8819 4.7
10329 4.6
1442 4.5
18 4.0
309 3.7
6167 3.5
10002 3.3
17 3.3

Note that the “volatility” score is now “magnitude”, since we’re looking only at upward behavior here. Magnitude is log_10, just like a Richter scale.

The details query

Now we want to see the behavior of the events that generated the top pops over the past eight hours. Though we could observe the behavior over an arbitrary time period, we’ll use eight hours just to keep things simple. Also, note that we want to see both upward and downward behavior here. In other words, even though we considered only pops for selecting the interesting events, we want to visualize the events holistically.

Here’s the query:

index=summary reporttype=eventcount reporttime=hourly
    [email protected] latest=now
  | join EventCode
      [search index=summary reporttype=eventcount reporttime=hourly
          [email protected] latest=now
        | eval currHour = strftime(now(), "%H")
        | eval prevHour = currHour - 1
        | where (prevHour - date_hour) % 24 < 8
        | eval near = if(_time >= relative_time(now(), "[email protected]"), 1, 0)
        | stats sum(count) as jfreq by date_hour EventCode near
        | eval tmp_nearFreq = if(near == 1, jfreq, 0)
        | eval tmp_farFreq = if(near == 0, jfreq, 0)
        | stats max(tmp_farFreq) as farFreq, max(tmp_nearFreq) as nearFreq
            by date_hour EventCode
        | where nearFreq > farFreq
        | eval change = (nearFreq + 1) / (farFreq + 1)
        | stats sumsq(change) as ssc by EventCode
        | sort ssc desc
        | head 10]
  | eval currHour = strftime(now(), "%H")
  | eval prevHour = currHour - 1
  | where (prevHour - date_hour) % 24 < 8
  | eval near = if(_time >= relative_time(now(), "[email protected]"), 1, 0)
  | stats sum(count) as jfreq by date_hour EventCode near
  | eval tmp_nearFreq = if(near == 1, jfreq, 0)
  | eval tmp_farFreq = if(near == 0, jfreq, 0)
  | stats max(tmp_farFreq) as farFreq, max(tmp_nearFreq) as nearFreq
      by date_hour EventCode
  | eval change = if(nearFreq >= farFreq,
      (nearFreq + 1) / (farFreq + 1), -((farFreq + 1) / (nearFreq + 1)))
  | chart median(change) by date_hour, EventCode

This query uses a slightly simplified version of the summary query as a subquery. This does the initial identification of the events we care about. Then we join on those events to get the hour-by-hour breakdown of those events, and plot them on a chart (actual event codes suppressed in the legend):

The x-axis is the hour of the day and the y-axis is the log scale change score, which is (roughly) the ratio of today’s count over yesterday’s count for pops.

To see a specific event, we just hover over its code in the legend (again, actual event codes suppressed in the legend):

This interactive chart gives us a nice way to understand better what’s happening with problematic event codes. We start by looking at the summary view (the table that lists the problem events in descending order by magnitude), and then highlight the individual events in the details view to see whether the issue is solved or else in progress.

Until next time, happy Splunking!

Acknowledgments

Thanks to Aman Bhutani, Malcolm Jamal Anderson, Deb Dey and Karan Shah for their contributions to the work here.

Posted in Operations | Tagged , , | Leave a comment

Top ten pops and drops with Splunk, part 1

In real-time operational contexts, it’s important to monitor system metrics for major swings in value. Such swings, whether upward (“pops”) or downward (“drops”), often indicate a problem either on the horizon or else in progress.

This can be a little tricky in cases where we have a vector of closely-related metrics that compete for attention, because we have to decide how to resolve that competition. Consider for example application events. We may have hundreds or even thousands of event codes. If one of them is having huge day-over-day pops in volume all day long, then that’s easy–the event code makes the top ten list. If another event code is stable, then that’s easy too–it doesn’t make the list. But what happens when we have an event code that has largish pops all day long and another event code that has a single huge spike in volume? Which wins?

This post is the first in a two-part series that explains how to identify the most volatile components in a vector and then visualize their respective time series. In this first post we’ll focus on the identification part, and in the second one we’ll do the time series visualization. Ultimately we want a single chart that shows for each of the top ten its corresponding pop/drop time series (e.g., against the hour of the day), like so:

We also want to be able to zero in on any given event:

(In the charts above I’ve suppressed the labels from the legend.)

Let’s dive in.

Measuring change

Ultimately we want to measure volatility, which is roughly aggregate change, whether upward or downward, over some given time period. But to get there we need to start by measuring change in its unit form.

To keep things concrete we’ll look at day-over-day changes broken down by hour. But the same concepts apply independently of the timescales and timeframes involved.

Also for concreteness, we’ll focus on a vector of event codes. But again the concepts are much more general. Here are some examples:

  • the view counts for a set of web pages
  • the response times for a set of service calls
  • the lead submission counts for a set of lead vendors

We can measure change for really any numeric quantity we like. To keep things exciting (not that this isn’t already exciting enough) we’ll walk through the various attempts I made here, since the reasons for rejecting the earlier attempts motivate the final approach.

Attempt #1: Ratios. Ratios are a good starting point for measuring change: what’s the ratio of events (of some given type) today during the 2am hour as compared to yesterday, for instance? Ratios are better than raw counts here because they have a normalizing effect.

The challenge with ratios is that they’re asymmetric about r = 1.0. Pops start at r > 1.0, and they can grow without bound. Drops on the other hand are bounded by the interval (0, 1). So when you see r = 23.0 and r = 0.0435, it’s not immediately clear that the changes involved differ in direction but not magnitude.

Attempt #2: Log ratios. One potential solution is to use log ratios–say log base 2 since we’re computer people here. Log ratios fix the asymmetry: log_2(23) = 4.5236, whereas log_2(1/23) = -4.5236.

Log ratios aren’t perfect either though. Most of us, despite being computer people, are probably more comfortable reasoning with ratios than we are with log ratios. If you tell me that the log ratio for some event code is 5, I can probably translate that to a ratio of 32/1 easily enough. But if you tell me the log ratio is 13.2, I’m not personally able to arrive at the actual ratio as quickly as I’d like.

Using base 10 might make it a little easier, but bottom line is that it’s more intuitive to talk about the actual ratios themselves.

Final solution: Modified ratios. Our approach will be to use signed, unbounded ratios. The numerator is always the larger quantity. For instance, if yesterday there were 100 events of a type and today there were 500, then the change score is 5.0 (a pop). If instead there were 500 events yesterday and only 100 today, then the change score is -5.0 (a drop).

This approach is pretty intuitive. If I tell you that the change score was -238.5, it’s immediately clear what that means: yesterday there were 238.5 times more events during some given hour than there were today. Pretty big drop, ratio-wise.

Now we have a change score. In the next section we’ll see to aggregate hour-by-hour change scores into an overall measure of volatility.

Measuring volatility

Recall that we have a vector of change scores here, and it might be pretty large. We need to know which ten or so of the hundreds or thousands possible are most deserving of our attention.

The approach again is to score each event code. We’ll create a volatility score here since we want to see the event codes with “more change” in some to-be-defined sense. In some cases we may care more about pops (or drops) than we do about volatility in general. Response times are a good example. That’s fine–it is easy to adapt the approach below to such cases.

For volatility, we’re going to use a sum of squares. The approach is straightforward: for any given event code, square each of its hourly change scores, and then add them all up. That’s volatility.

The sum of squares is nice for a couple of reasons:

  • Pops and drops contribute equally to overall volatility, since change scores for pops and drops are symmetric, and since the square of a change score is always positive.
  • Squaring the change scores amplifies the effect of large change scores. (If we had change scores in the interval (-1.0, 1.0) then it would also diminish their effect, which is sometimes useful. But in our case there are no such change scores.) So a huge pop that happened only once is still likely to make it into the top ten.

That’s the approach. It’s time to see the Splunk query and the result.

Splunk it out

Here’s a Splunk search query that yields today’s top ten most volatile event codes, measuring changes day-over-day:

index=summary reporttype=eventcount reporttime=hourly [email protected] latest=now
    | eval marker = if(_time < relative_time(now(), "@d"), "yesterday", "today")
    | stats sum(count) as jfreq by date_hour EventCode marker
    | eval tmp_yesterday = if(marker == "yesterday", jfreq, 0)
    | eval tmp_today = if(marker == "today", jfreq, 0)
    | stats max(tmp_yesterday) as yesterday, max(tmp_today) as today by date_hour EventCode
    | eval change = if(today >= yesterday, today / yesterday, -(yesterday / today))
    | stats sumsq(change) as volatility by EventCode
    | sort volatility desc
    | head 10

Note that we’re using a Splunk summary index here. This is important since otherwise the query will take too long.

Technically, here’s what the query does:

  1. Grab the (trivariate) joint frequency distribution over the variables (1) hour of day, (2) event code and (3) today-vs-yesterday.
  2. Extract change scores against hour of day and event code, considered jointly. Recall that this is our modified ratio involving today’s count and yesterday’s.
  3. For each event code, aggregate change scores into a volatility score as we described above.
  4. Show the ten event codes having the highest volatility.

You can see the effect of individual filters by peeling them off from the end of the query. See the Splunk Search Reference for more information on what’s happening above.

Looking at the chart view yields the following:

Along the x-axis we have a bunch of event codes (I’ve suppressed the labels), and on the y-axis we have volatility. This chart makes it immediately clear which ten of the hundreds of event codes are seeing the most action.

Conclusion

Even though this isn’t the whole story, what we have so far is already pretty useful. It tells us at any given point at time which event codes are most deserving of our attention from a volatility perspective. This is important for real-time operations.

Recall that we can generalize this to other sorts of metric as well, and also to different timescales and timeframes.

Geek aside: There’s a relationship between what we’re doing here and linear regression from statistics. With linear regression, we start with a bunch of data points and look for a linear model that minimizes the sum of squared errors. We’re doing something loosely analogous here, but in the opposite direction. We start with a linear baseline (change score = 1, representing no change–this isn’t precisely correct but it’s the basic idea we’re after here), and then treat each event code as a set of data points (change score vs. hour of day) against that baseline. The event codes that maximize the sum of squares are the worst-fit event codes, and thus the ones most worthy of our attention.

In the next post we’ll see how to visualize the actual pops and drops time series for the top ten.

Acknowledgements

Thanks to Aman Bhutani for suggesting the pops and drops approach in the first place, and to both Aman Bhutani and Malcolm Jamal Anderson for helpful discussions and code leading to the work above.

Posted in Big data, Operations | Tagged , , , , , | 1 Comment

Pushing twice daily: our conversation with Facebook’s Chuck Rossi

At my new job we’re reigniting an effort to move to continuous delivery for our software releases. We figured that we could learn a thing or two from Facebook, so we reached out to Chuck Rossi, Facebook’s first release engineer and the head of their release engineering team. He generously gave us an hour of his time, offering insights into how Facebook releases software, as well as specific improvements we could make to our existing practice. This post describes several highlights of that conversation.

What’s so good about Facebook release engineering?

The core capability my company wants to reproduce is Facebook’s ability to release its frontend web UI on demand, invisibly and with high levels of control and quality. In fact Facebook does a traditional-style large weekly release each Tuesday, as well as not-so-traditional two daily pushes on all other weekdays. They are also able to release on demand as needed. This capability is impressive in any context; it’s all the more impressive when you consider Facebook’s incredible scale:

  • Over 1B users worldwide
  • About 700 developers committing against their frontend source code repo
  • Single frontend code binary about 1.5GB in size
  • Pushed out to many thousands of servers (the number is not public)
  • Changes can go from check-in to end users in as quickly as 40 minutes
  • Release process almost entirely invisible to the users

Holy cow.

While the release engineering problem for my company is considerably smaller than the one confronting Facebook, it’s not by any means small. (Facebook is so massive that user bases orders of magnitude smaller than Facebook can still have nontrivial scale.) We don’t have to contend with the 1B users, 700 developers, 1.5GB binary or many thousands of servers. But we do want to be able to release on demand, quickly, reliably and invisibly to our users.

How Facebook pushes twice daily to over 1B users

The common thread running through the practices below is that they reject the supposed tradeoff between speed and quality. Releases are going to happen twice a day, and this needs to occur without sacrificing quality. Indeed, the quality requirements are very high. So any approach to quality incompatible with the always-be-pushing requirement is a non-starter.

Here are some of the key themes and techniques.

Empower your release engineers

Chuck mentioned early on that the whole thing rides on having an empowered release engineering team. Ultimately release engineers have to strike a balance between development’s desire to ship software and operations’ desire to keep everything running smoothly. Release engineers therefore need access to the information that tells them whether a given change is a good risk for some upcoming push, as well as the authority to reject changes that aren’t in fact good risks.

At the same time, we want release engineers that “get it” when it comes to software development. We don’t want them blocking changes just because they don’t understand them, or just because they can. Facebook’s release engineers are all programmers, so they understand the importance of shipping software, and they know how to look at test plans, stack traces and the code itself should the need arise.

Empowerment is part cultural, part process and part tool-related.

On the cultural side, Chuck introduces new hires to the release process, and makes it clear that the release engineering team makes the decision.

As part of that presentation, he explains how the development, test and review processes generate data about the risk associated with a change. The highly integrated toolset, based largely around Facebook’s open source Phabricator suite, provides visibility into that change risk data.

Just to give you an idea of the expectation on the developers, there are a number of factors that determine whether a change will go through:

  • The size of the diff. Bigger = more risky.
  • The quality of the test plan.
  • The amount of back-and-forth that occurred in the code review (see below). The more back-and-forth, the more rejections, the more requests for change—the more risk.
  • The developer’s “push karma”. Developers with a history of pushing garbage through get more scrutiny. They track this, though any given developer’s push karma isn’t public.
  • The day of the week. Mondays are for small, not-risky changes because they don’t want to wreck Tuesday’s bigger weekly release. Wednesdays allow the bigger changes that were blocked for Monday. Thursdays allow normal changes. Changes for Friday can’t be too risky, partly because weekend traffic tends to be heavier than Friday traffic (so they don’t want any nasty weekend surprises), and partly because developers can be harder to reach on weekends.

The release engineers evaluate every change against these criteria, and then decide accordingly. They process 30-300 changes per day.

Test suite should take no longer than the slowest test

When you’re releasing code twice a day, you have to take testing very seriously. Part of this is making sure that developers write tests, and part of this is running the full test suite—including integration and acceptance tests—against every change before pushing it.

In some development organizations, one major challenge with doing this is that integration tests are slow, and so running a full regression against every change becomes impractical. Such organizations—especially those that practice a lot of manual regression testing—often handle this by postponing full regression testing until late in the release cycle. This makes regression testing more cost-feasible because it happens only once per release.

But if we’re trying to push twice daily, the run-regression-at-the-end-of-the-release-cycle approach doesn’t work. And neither does truncating the test suite. We can’t give up the quality.

Facebook’s alternative is simple: apply extreme parallelization such that it’s the slowest integration test that limits the performance of the overall suite. Buy as many machines as are required to make this real.

Now we can run the full battery of tests quickly against every single change. No more speed/quality tradeoff.

Code review EVERYTHING

Chuck was at Google before he joined Facebook, and apparently at both Google and Facebook they review every code change, no matter how small. Whereas some development shops either practice code review only in limited contexts or else not at all, pre-push code reviews are fundamental to Facebook’s development and release process. The process flat out doesn’t work without them.

As the session progressed, I came to understand some reasons why. One key reason is that it promotes the right-sizing of changes so they can be developed, tested, understood and cherry-picked appropriately. Since Facebook releases are based on sets of cherry picks, commits need to be smallish and coherent in a way that reviews promote. And (as noted above) the release engineers depend upon the review process to generate data as to any given change’s riskiness so they can decide whether to perform the cherry pick.

Another important benefit is that pre-push code reviews can make it feasible to pursue a single monolithic code repo strategy (often favored for frontend applications involving multiple components that must be tested together), because breaking changes are much less likely to make it into the central, upstream repo. Facebook has about 700 developers committing against a single source repository, so they can’t afford to have broken builds.

Facebook uses Phabricator (specifically, Differential and Arcanist) for code reviews.

Practice canary releases

Testing and pre-push reviews are critical, but they aren’t the entire quality strategy. The problem is that testing and reviews don’t (and can’t) catch everything. So there has to be a way to detect and limit the impact of problems that make their way into the production environment.

Facebook handles this using “canary releases”. The name comes from the practice of using canaries to test coal mines for the presence of poisonous gases. Facebook starts by pushing to six internal servers that their employees see. If no problems surface, they push to 2% of their overall server fleet and once again watch closely to see how it goes. If that passes, they release to 100% of the fleet. There’s a bunch of instrumentation in place to make sure that no fatal errors, performance issues and other such undesirables occur during the phased releases.

Decouple stuff

Chuck made a number of suggestions that I consider to fall under the general category “decouple stuff”. Whereas many of the previous suggestions were more about process, the ones below are more architectural in nature.

Decouple the user from the web server. Sessions are stateless, so there’s no server affinity. This makes it much easier to push without impacting users (e.g., downtime, forcing them to reauthenticate, etc.). It also spreads the pain of a canary-test-gone-wrong across the entire user population, thus thinning it out. Users who run into a glitch can generally refresh their browser to get another server.

Decouple the UI from the service. Facebook’s operational environment is extremely large and dynamic. Because of this, the environment is never homogeneous with respect to which versions of services and UI are running on the servers. Even though pushes are fast, they’re not instantaneous, so there has to be an accommodation for that reality.

It becomes very important for engineers to design with backward and forward compatibility in mind. Contracts can evolve over time, but the evolution has to occur in a way that avoids strong assumptions about which exact software versions are operating across the contract.

Decouple pushes from feature activation. Facebook uses dark launches and feature flags to decouple binary pushes from the activation of features. The general concept is for the features to exist in latent form in the production environment, with a means to activate and deactivate them at will.

Dark launches and feature flags further erode the speed/quality tradeoff. You can release code without activating it, giving you a way to get it out the door without impacting users. And when you do activate it, you have a way to turn it off immediately should a problem arise. These techniques also simplify source code management because you can just manage everything on trunk instead of having a bunch of branches sitting around waiting to be merged.

Facebook uses an internally-developed tool called Gatekeeper to manage feature flags. Gatekeeper allows Facebook to turn feature flags on and off, and to do that in a flexibly segmented fashion.

Recap and concluding thoughts

I mentioned earlier that Facebook rejects the apparent tradeoff between speed and quality. At their core, the practices above amount to ways to maintain quality in the face of rapid fire releases. As the overall release practice and infrastructure matures, opportunities for further speedups and quality enhancements emerge.

As you can see, our one hour conversation was packed with a lot of outstanding information. I hope that others might benefit from this material in the way that I know my company will. Thanks Chuck!

Additional resources for Facebook release engineering

Facebook publishes a great deal of useful information about their release engineering processes. Here are some good resources to learn more, mostly directly from Chuck himself.

Posted in Continuous delivery, Devops principles | Tagged , , , | 4 Comments

How I got my kids into programming

[Not a devops post, but I wanted to share this somewhere.]

I’m the father of four wild and crazy kids, and the two older ones are at the age now (4th grade and 1st grade) that I’ve started trying to get them interested in programming.

At first I wasn’t having much luck. It was always me asking kid #1 whether he wanted to program, and the answer was pretty much always “no”. I started wondering whether he was really mine at all since no spawn-o-mine would answer in that way.

But I eventually changed my approach. Soon my kids were asking me if we could go program. The 4th grader was up on early on Saturday morning programming, programming when it was time for dinner, etc. One day, I was surprised to see one of his pushes show up on my GitHub wall, completely without my involvement. New approach was working.

Yesterday a friend asked me to share some tips, as she has a fourth-grader. Here are some ideas that have been working for me and my kids.

Relate programming to something your kids already care about. It’s the rare child who will find variable assignment, for loops, arrays, etc. inherently interesting, at least at first. But writing a Minecraft mod—now you’re talking. Games in general are a favorite here. But whatever you choose, present programming as a means to an end instead of as an end in itself.

[Note: I'm not saying that a Minecraft mod is an appropriate first project. I've never written one. But the idea that they could write a Minecraft mod was very exciting to my kids. I ended up writing a simple game engine and then having the kids help out with that. You can easily find existing games if you don't want to write your own.]

Start by tweaking rather than writing from scratch. This gives the reward of immediate feedback and offers a gentle and incremental introduction. Learning new concepts, syntax, problem-solving approaches (e.g., divide and conquer)—these are hard enough for adults, let alone a child who hasn’t even taken Algebra I yet.

Introduce simple foundational, not-directly-programming skills. There are lots of examples here and I want to share several.

  • Show your kids how to use the basic functions of the IDE (editing, saving, etc.). Just using the tool they see you use (assuming you’re a programmer) can be exciting to them.
  • Put their code up on GitHub (or whatever), and show them how to push and pull code from central. The kids learn foundational concepts like “repo-out-there” (networking), that they can share code with siblings on different computers, that they should coordinate if they want to work on the same thing (teamwork), writing commit messages, etc. And with GitHub in particular, the social aspects are fun.
  • Content can be initially more engaging than code. So if you have a game, show the kids that they can modify the maps, character definitions and so on by modifying content files. In my kids’ game, there are a bunch of ASCII files that serve as 2D game maps. The game engine requires that each line have exactly the same length. Sometimes when there’s a mistake in the map, the game breaks. The kids learn a ton from working with these maps: the idea that symbols in the map represent tiles in the game, that the computer is really picky about things like whether the line has 119 or 120 characters, how to look at a stack trace for ArrayIndexOutOfBounds to know that there’s a mistake in the line length, etc. All foundational stuff.
  • Configuration is another area that can be fun. Normally when the characters traverse the game map, they can’t walk through walls, over water, etc. They were having to build bridges between all the islands just to see what things looked like. I added a “god mode” flag that allows them to walk anywhere they wanted. Again, good foundational stuff here—the idea of configuration itself (kind of like game cheats), variable assignment and minor coding if your configuration is actually in the code.

Obviously not a complete list (happy to hear additional thoughts), but those have been helpful for my kids.

Posted in Off-topic | Tagged , , | 4 Comments

Designing configuration management schemas

One important issue that comes up when undertaking a configuration management effort is how to design “the schema” for configuration management data. Obviously there’s no one-size-fits-all answer here. But there are a couple of general and complementary approaches you need to know about if you’re working on this. In this post we’re going to look at those.

Some background: configuration and state management

First, let’s take a 50K view of a fairly generic management architecture for an operational environment:

There are three major components in the diagram above:

  • an environment we want to manage
  • an infrastructure for managing the environment’s configuration: farm and instance definitions (including cardinality), middleware deployments, app deployments, etc.
  • an infrastructure for managing the environment’s state: availability, performance, etc.

Configuration management has a couple of important responsibilities. One is that it has to offer a way to realize desired configurations in the managed environment. For example, it would provide machinery for provisioning, deployment and rollbacks. Another responsibility is to maintain that configuration over time until somebody pushes a new desired configuration through. Both responsibilities are blueprint-driven: a blueprint describes our expectation, and there are mechanisms in place to establish and maintain the configuration against the blueprint.

State management has a similar maintenance responsibility: it has to know what counts as healthy functioning (usually defined by SLAs), and it has to manage state by detecting, diagnosing and remediating excursions. In real life this is usually a combination of automation and manual work. Monitoring is usually automated, whereas it’s pretty common to see tool-assisted manual effort on the diagnosis and remediation side. But automation or no, the job is to maintain healthy state.

Note that configuration management includes an as-is configuration repository, which describes the configuration that’s actually in the environment. Configuration management uses this to find deltas between actual and desired configuration. State management uses it largely for diagnostic purposes, like tracing state issues down through a chain of dependencies.

(A brief aside: In the diagram we have the deployment engine populating the as-is config repo, but that’s only one way to do it, and anyway it’s incomplete. Sometimes there’s an automated discovery process that finds servers and devices in the environment and records them somewhere. Other times, the instance provisioning process installs monitoring agents that phone home to a central server, which effectively becomes an as-is config repo. There can also be security agents on the machine that check files against known checksums and complain into a database when there are changes. These aren’t mutually exclusive, which incidentally can make it hard to get a good read on as-is configuration. But I digress.)

Schema design for configuration management

We said above that configuration management has to establish and maintain desired configurations in the managed environment. This isn’t simply about limiting configuration drift. We want to make it impossible (or at least hard) for wrong configurations to appear in the environment in the first place. For example, we probably never want to see a server farm with three Ubuntu 12.04 instances and one Ubuntu 11.10 instance. But we want to eliminate drift too.

One powerful technique falls out of the blueprint-driven approach. Recall:

On this design, the blueprint states intent, and the deployment engine makes it real. So if we don’t want to see bogus configurations appear in the environment, one approach is to make them impossible to represent with our blueprints in the first place. And if we don’t want to see bogus configurations passing our periodic audits (e.g., agents in the environments just watching over stuff), we can once again adopt the approach of making such configurations impossible to express with our blueprints.

To see how this works, consider the following schema for describing a server farm.

This is a pretty natural way to see the world, and hence a natural way to design a schema. The farm has a bunch of instances. Each instance has a type, an image (OS + burnt-in packages) and some number of add-on packages.

But it’s not the only way to see the world. Consider the following alternative:

On this second schema, the farm has an “instance build”, which is just a definition that combines an instance type, image and set of packages together in a single group. The instances still have types, images and packages, but only indirectly through the farm and build entities.

The second schema is superior for blueprinting. Why? The second schema makes certain unwanted configurations impossible to express, and hence impossible to propagate to the managed environment:

  • It’s impossible on the second schema to describe a farm with three Ubuntu 12.04 instances and one Ubuntu 11.10 instance. Any instances in the farm get the farm’s single configuration. The first schema, on the other hand, allows this wrong configuration.
  • If you want a catalog of standard (or at least defined) builds, the second schema allows you to enforce its use. Of course, if you don’t want a catalog, then you can just drop the instance build entity from the schema and associate its child entities directly with the farm.

The first schema is superior for representing as-is configuration. In the real world, bad configurations occur (e.g., half-completed deployments, rogue sysadmins, through security vulnerabilities, through human error, etc.). With the blueprinting schema, we’re trying to make it impossible to express bad configurations. So by definition it’s not going to be up to the task of representing the bad configurations that actually occur.

Wrap-up

The takeaway is that there’s not just one configuration management schema to design. There are two: a blueprinting schema for expressing desired configuration, and an as-is schema for expressing actual configuration.

We want to embed key domain constraints in the blueprinting schema to the extent that it’s feasible to do so without over-complicating the schema. (In practice we want to identify the high-risk misconfigurations and focus on handling those.) The approach is more abstract: we say things like “it’s really the farm that has an instance build, not the instance”.

In the as-is schema we want to be more unconstrained and concrete so as to represent the wide variety of incorrect configurations that can actually happen.

Posted in Architecture, Configuration management, Devops principles | Tagged , , , | 1 Comment

A fatal impedance mismatch for continuous delivery

Most of the time, when organizations pursue a continuous delivery capability, they’re doing that in pursuit of increased agility. They want to be able to release software at will, with as little delay between the decision to implement a feature and the feature’s availability to end users.

I’m a big fan of agility, and agree with the idea that agility and continuous delivery go hand in hand. There are unfortunately ill-conceived approaches to implementing agility that can prove fatal to a continuous delivery program. In this post we’re going to take a look at one that occurs in larger organizations. We’ll see one reason why it can be challenging to implement continuous delivery in such environments.

Software development in the enterprise

One fairly common configuration in large enterprises is for there to be a shared production environment and multiple development groups creating software to be released into that environment. Sometimes the development groups have the ability to push their own changes into production. But often there’s some central release team, whether on the software side of the house or on the infrastructure/operations (I’ll call them IT in this post) side, that controls the change going into the production environment. Here’s what the red-flag—but common—configuration looks like:

Let’s see what tends to happen in such enterprises.

The quest for agility leads to development siloing

When there are multiple development groups, they usually want to be able to do things their own way. They set up their own source repos, configuration repos, continuous integration infrastructure, artifact repos, test infrastructure (tools, environments) and deployment infrastructure. They have their own approaches to using source control (including branching strategies), architectural standards, software versioning schemes and so on. They see themselves as being third-party software development shops, or at least analogous to them. Releasing and operationalizing the software is largely somebody else’s concern. They certainly don’t want some central team telling them how to do their jobs.

There’s a reason the development groups want things this way: agility. The central release team is either seen to be a barrier to agility, or in many cases, actually is a barrier to agility. There are tons of reasons for both the perception and the reality here. If the central team lives in the IT organization instead of living in a software organization, the chance for misalignment is very high. Common challenges include:

  • IT doesn’t understand best practices around software development (e.g., continuous integration, unit testing, etc.).
  • IT takes on a broad ITIL/ITSM scope when the development groups would prefer that they focus on infrastructure like IaaS providers do.
  • IT chooses big enterprise toolsets that aren’t designed around continuous delivery, integration with development-centric tools and so forth.
  • IT prioritizes concerns differently than the development groups do. In many cases IT is trying to throttle change whereas development is trying to increase change velocity.

But even if the central team manages to escape the challenges above, fundamentally shared services balance competing concerns across multiple customers, and they’re therefore usually suboptimal for any one customer. All it takes is for one or two developers to say, “I can do better” (a pretty common refrain from developers), and suddenly we end up with a bunch of development teams doing things their own way.

This is really bad for continuous delivery. Let’s see why.

Development siloing creates a fatal impedence mismatch

Let’s start with a little background.

Because continuous delivery aims to support increased deployment rates into production, it becomes especially important to test the deployment mechanism itself (including rollbacks). The challenge, though, is that any given production deployment is a one-time, high-stakes activity. So we need a way to know that the deployment is going to work.

Continuous delivery solves this through something called the deployment pipeline. This is a metaphorical pipeline carrying work from the developer’s machine all the way through to the production environment. The key insight from a deployment testing perspective is that earlier stages of the pipeline (development, continuous integration) involve high-volume, low-risk deployment activity, whereas later stages (systems integration testing, production) involve low-volume, high-risk deployment activity. If we make earlier stages as production-like as possible and we use the same deployment automation throughout the pipeline, then we have a pretty good way to ensure that production deployments will work. Here’s an example of such a pipeline; your environments may be different depending on the needs of your organization:

The area of any given environment in the pyramid represents the volume of deployment activity occurring in that environment. For development, the volume is large indeed since it happens across entire development teams. But notice that everything other than the production deployment itself helps test the production deployment.

As you can see, even the developer’s local development environment (e.g., the developer’s workstation or laptop) should be a part of the pipeline if feasible, since that’s where the greatest deployment volume occurs. One way to do this, for instance, is to run local production-like VMs (say via Vagrant and VirtualBox), and then use configuration management tools like Chef or Puppet along with common app deployment tools or scripts throughout the pipeline.

With that background in place, we’re now in position to understand why development siloing is bad for continuous delivery. When development teams see themselves as wholly separate from the operations side of the house, two major problems arise.

Problem 1: Production deployments aren’t sufficiently tested

This happens because the siloed development and operations teams use different deployment systems. In one example with which I’m personally familiar, the development team wrote its own deployment automation system, and stored its app configuration and versionless binaries in Amazon S3. The ops team on the other hand used a hybrid Chef/custom deployment automation, sourced app configuration from Subversion and versioned binaries from Artifactory.

Generically, here’s what we have:

The earlier pipeline stages use a completely different configuration management scheme than the later stages. Because of the siloing, only a small region of the pyramid tests the production deployment. So when it’s time to deploy to SIT, there’s a good chance that things won’t work. And that’s true of production too.

The next problem is closely related, and even more serious.

Problem 2: Impedance mismatch between development and operations

Having two disjointed pipeline segments means that there’s an impedance mismatch that absolutely prevents continuous delivery from occurring in anything beyond the most trivial of deployments:

From personal experience I can tell you that this impedance mismatch is a continuous delivery killer. Keep in mind that the whole goal of continuous delivery is to minimize cycle time, which is the time between the decision to implement a change (feature, enhancement, bugfix, whatever) and its delivery to users. So if you have a gap in the pipeline, where people are having to rebuild packages because development’s packaging scheme doesn’t match operation’s packaging scheme, painstakingly copy configuration files from one repository to another, and so forth, cycle time goes out the window. Add to that the fact that we’re not even exercising the deployment system on the deployment in question until late in the development cycle, and cycle time takes another hit as the teams work through the inevitable problems and miscommunications.

Avoiding the impedance mismatch

In Continuous Delivery, Humble and Farley make the point that while they’re generally supportive of a “let-many-flowers-bloom” approach to software development, standardized configuration management and deployment automation is an exception to the rule. Of course, there will be differences between development and production for cost or efficiency reasons (e.g., we might provision VMs from scratch with every production release, but this would be too time-consuming in development), but the standard should be production, and deviations from that in earlier environments should be principled rather than gratuitous.

So to avoid the impedance mismatch, it’s important to educate everybody on the importance of standardizing the pipeline across environments. If there’s a central release team, then that means all the development teams have to use whatever configuration management infrastructure it uses, since otherwise we’ll end up with the disjointed pipeline segments and impedance mismatch. But even if development teams can push their own production changes, it’s worth considering having all the teams use the same configuration management infrastructure anyway, since this approach can create deployment testing economies.

Some teams resist standardization instinctively, largely because they see it as stifling innovation or agility. Sometimes this is true, but for continuous delivery, standardization is required to deliver the desired agility. It can be useful to highlight cases where they accepted standardization for good reasons (e.g., standardized look and feel across teams for enhanced user experience, standardized development practices reflecting lessons learned, etc.), and then explain why continuous delivery is in fact another place where it’s required.

One sort of objection I’ve heard to a standardized pipeline came from the idea that the (internal) development team was essentially a third-party software vendor, and as such, ought not have to know anything about how the software is deployed into production. In particular it ought not have to adopt whatever standards are in place for production deployments.

This objection raises an interesting issue: it’s important to establish the big-picture model for how the development and operations teams will work together. If the development team is really going to be like a third-party vendor, where it’s independent of any given production environment, then it’s correct to decouple its development flow from any given flow into production. But then you’re not going to see continuous delivery any more than you would expect to see continuous delivery of software products from a vendor like Microsoft or Atlassian into your own production environment. Here leadership will have to choose between the external vs. internal development models. If the decision is to pursue the internal development model and continuous delivery, then pipeline standardization across environments is a must.

Posted in Architecture, Continuous delivery, Continuous integration, Devops principles | Tagged , , , , | 1 Comment

Large-scale continuous integration requires code modularity

Where large development teams and codebases are involved, code modularity is a key enabler for continuous delivery. At a high level this shouldn’t be too terribly surprising—it’s easier to move a larger number of smaller pieces through the deployment pipeline than it is to push a single bigger thing through.

But it’s instructive to take a closer look. In this post we’ll examine the relationship between continuous integration (which sits at the front end of the continuous delivery deployment pipeline) and code modularity. Code modularity helps at other points in the pipeline—for example, releases—but we’ll save that for perhaps a future post.

The impact of too many developers working against a single codebase

For a given codebase, continuous integration (CI) scales poorly with the number of developers. Fundamentally there are a couple of forces at work here: with increasing developers, (1) the size of the codebase increases, and (2) commit velocity increases. These forces conspire in some nasty ways to create an painful situation around builds. Let’s take a look:

  • Individual builds take longer. As the size of the codebase increases, it obviously takes longer to compile, test, deploy, generate reports and so forth.
  • More broken builds. Even if developers are disciplined about running private builds before committing, any given commit has a nonzero chance of breaking the build. So the more commits, the more broken builds.
  • A broken build has a greater impact. In a “stop the line” shop, more developers means more people blocked when somebody breaks the build. In other shops, people just keep committing on top of broken builds, making it more difficult to actually fix the build. Either way it’s bad.
  • Increased cycle times. After a certain point, the commit velocity and individual build times, taken together, become sufficiently high that the CI builds run constantly throughout the workday. In effect the build infrastructure is unavailable to service commits on a near-real-time basis, which means that developers must wait longer to get feedback on commits. It also means that when builds do occur, they involve multiple stacked-up commits, making it less clear exactly whose change broke the build. This again increases feedback cycle times. (Note that there are some techniques outside of modularization that can help here, such as running concurrent builds on a build grid.) Once the feedback cycle takes more than about ten or fifteen minutes, developers stop paying attention to build feedback.
  • Individual commits become more likely to break the build. Even though the global commit velocity increases, individual developers may commit less often because committing is a painful and risky activity. Changelist sizes increase, which makes any given commit more likely to result in a broken build.
  • Delayed integration. Painful and risky builds create an incentive to develop against branches and merge later, which is exactly the opposite of continuous integration. Integrations involving such branches consume disproportionately more time.
  • General disruption of development activities. Ultimately the problems above become very serious indeed: developers spend a lot of time blocked, and the situation becomes a huge and costly distraction for both developers and management.
  • Difficult to make improvements. When everybody is working on the same codebase, it’s harder to see where the problems are. It could be that a certain foundational bit of the architecture is especially apt to break the build, but there aren’t enough tests in place for it. (Meanwhile some other highly stable part of the system is consuming the “build budget” with its comprehensive test suite.) Or perhaps certain teams are have better processes in place (e.g., a policy of running private builds prior to committing) than others. Or it may be that some individual developers are simply careless about their commits. It’s hard to know, and thus difficult to manage and improve.

There are various possible responses to the challenges above. One can, for example, scale the build infrastructure either vertically (e.g., more powerful build servers) or horizontally (e.g., build grids to eliminate build queuing). Another tactic is to manage test suites and tests themselves more carefully: individual tests shouldn’t run too long, test suites shouldn’t run too long, etc. Make sure people are using doubles (stubs, mocks, etc.) where appropriate. Etc. But such responses, while genuinely useful, are more like optimizations than root cause fixes. Vertical scaling eventually hits a wall, and horizontal scaling can become expensive if resources are treated as if they’re free, which often happens with virtualized infrastructure. Limiting test suite run times is of course necessary, but if it’s done over too broad a scope, it results in insufficient coverage.

The root cause is too many cooks in the kitchen.

Enable continuous integration by modularizing the codebase

It would be incorrect to draw the conclusion that continuous integration works only for small teams. CI works just fine even with large teams developing to large codebases. The trick is to break up the codebase so that everybody isn’t committing against the same thing. But what does that mean?

Here’s what it doesn’t mean: it doesn’t mean that each team should branch the codebase and work off of branches until it’s time to merge. This just creates huge per-branch change inventory that has to be integrated at the end (or more likely toward the middle) of the release. Again this is the opposite of continuous integration.

Instead, it’s the codebase itself that needs to be broken up. Instead of one large app or system with a single source repo and everybody committing against the trunk, the app or system should be modularized. If we can carve that up into services, client libraries, utility libraries, components, or whatever, then we should do that. There’s no one-size-fits-all prescription for deciding when a module should get its own source repo (as opposed, say, to having a bunch of Maven modules in a single source repo), but we can apply judgment based on the coherence and size of the code as well as the number of associated developers.

Modularizing the code helps with the various continuous integration problems we highlighted above by reducing the size of the build, reducing the commit velocity, and removing incentives to delay integration. It has other important advantages outside of continuous integration, such as decoupling teams from a release planning perspective, making it possible to be more surgical when doing production rollbacks, and so forth. But the advantages to continuous integration are huge.

Note that code modularization brings its own challenges. Code modules require interfaces, which in turn require coordination between teams. SOA/platform approaches will likely require significant architectural attention to address issues of service design, service evolution, governance and so forth. Moreover there will need to be systems integration testing to ensure that all the modules play nicely together, especially when they are evolving at different rates in a loosely coupled fashion. But the costs here are enabling in nature, with a return on investment: greater architectural integrity and looser coupling between teams. The costs we highlighted earlier in the post are pure technical debt.

Posted in Architecture, Continuous integration, Devops principles | Tagged , , | Leave a comment

Continuous integration with GitHub, Bamboo and Nexus

This is the first in what will be a series of posts on how to establish a continuous delivery pipeline. The eventual goal is to have an app that we can push out into production anytime we like, safely and with little effort. But we’re going to start small and proceed incrementally, which is the best way to undertake this sort of effort.

In this post we’ll start by establishing continuous integration for an arbitrary Java/Maven-based open source library project. It shouldn’t take more than an afternoon to do this more or less from scratch if you’re reasonably familiar with the technologies in question, except for the part where we have to get a Maven repo. We have to wait for Sonatype to approve the repo, but they’re pretty fast about it.

I set this up for an open source library project I’m doing called Kite. The details of the project itself aren’t important for this exercise, but the open source library part is:

  • Libraries (as opposed to apps) are easier to deal with, because deployment is just a matter of getting the binary into a Maven repo, as opposed to getting it running on a live server.
  • Open source means that we can get a bunch of infrastructure freebies.

Here’s what it will look like at the end of this post:

So if you have an open source library you’ve been wanting to develop, now’s the time to get started!

Find a place to host your project sources

The right host depends on which SCM you prefer. If you like Git, then GitHub and Bitbucket are two obvious (free) choices. Bitbucket supports Mercurial as well, and it supports not just free public repos like GitHub, but free private repos as well. Anyway, I chose GitHub for Kite. Here’s the repo: GitHub Kite repo

Set up a continuous integration server

The next step is to set up a continuous integration server. Hudson and Jenkins are both pretty popular open source offerings, but I chose Atlassian’s Bamboo because the UI is a lot more polished. Bamboo is free for open source projects, and even if you decide to buy it, it’s only $10 for the starter license (proceeds go to charity), which gives you a local build agent. Anyway, you can’t go wrong with any of those, so choose one that makes sense and set it up. They’re all easy to set up.

For Bamboo, you’ll need to install Java, Tomcat (or whichever container you like), Git and Maven 3 on the server as well. Bamboo is a Java web app, which is why you need Java and Tomcat. Bamboo uses Git to pull the code down from GitHub, and Maven to build the code. You’ll need to configure this using the Bamboo admin console, once you’ve installed the packages above, you can do the configuration entirely through the Bamboo UI and it’s very easy.

After installing Ubuntu 11.10 server manually, I used Chef to set up everything on my Bamboo server except for deploying the Bamboo WAR itself, since Chef has community cookbooks for Java, Tomcat, Maven and Git. (Note however that currently the URL in the Maven 3 recipe is wrong, so if you go that way then you’ll need to update the URL and the checksum.) Jo Liss has a great tutorial on Chef Solo if you’re interested in giving this a shot. But you don’t have to use Chef. You can just use your native package manager if you prefer to do it that way.

Once you have the executables all set up, you’ll need to create a build plan that slurps source code from GitHub and builds it with a Maven 3 task.

In the screenshot you’ll notice that I created three separate stages here: a commit stage, an acceptance stage and a deploy stage. The idea, following Continuous Delivery, is to separate the fast-but-not-so-comprehensive feedback part from the slow-but-more-comprehensive feedback, and run those as different stages in the build. That way you know right away in most cases where you break the build (commit stage fails), and you still find out soon enough even in those cases where the breakage involves some kind of integration or business acceptance criterion. So in my Bamboo plan I do the commit stage first, then run integration tests during the acceptance stage, and finally deploy the snapshot build to my Maven repo using mvn deploy if the previous two stages pass. That makes it available to others for continuous integration.

Let’s look at the Maven repo part now.

Set up a Maven repo

Sonatype offers a hosted Nexus-based Maven repo for free to open source projects. Moreover they’ll push your artifacts out to Maven central for you as well. So if that sounds good to you, here are instructions on signing up.

JFrog’s Artifactory is an option as well. As you can see from the link, there’s an open source option. We use this at work and it’s pretty nice.

For Sonatype, you’ll need to create a JIRA ticket to get the repo, as per the instructions above. Here’s mine.

Also, your POM will need to conform to certain requirements; see this example for the POM that I’m using for Kite. Nothing too out-of-the-ordinary, though do take note of the parent POM.

Anyway, once you have your Maven repo, practice manually deploying from the command line via mvn deploy just to make sure you’re able to push code. If it works, then you’ll want to try it out from Bamboo. Be sure to update Bamboo’s copy of Maven’s settings.xml so Bamboo can authenticate into the Sonatype repo:

<?xml version="1.0" encoding="utf-8"?>
<settings>
    <servers>
        <server>
            <id>sonatype-nexus-snapshots</id>
            <username>your_username</username>
            <password>your_password</password>
        </server>
        <server>
            <id>sonatype-nexus-staging</id>
            <username>your_username</username>
            <password>your_password</password>
        </server>
    </servers>
</settings>

Once you have it working, you should be able to push your snapshots over to Sonatype. Here’s the Nexus UI, and here’s the raw directory browser view.

Conclusion

Though it takes a bit of effort to set this up, at the end of the day you have a good foundation for your continuous delivery efforts.

In the next post in this series, we’ll expand our deployment pipeline so it can handle pushing a web application to a live, cloud-based container.

Posted in Architecture, Execution | Tagged , , , , , , | 2 Comments

When building a CMDB, separate the UI from the API

One lesson I’ve learned in building CMDBs is to cleanly separate the UI from the web service API. In the Java world, for example, this means that the API should be its own package (e.g., WAR file). The UI should be separate, whether that’s a separate WAR, a rich client (Flex, JavaFX, etc.), GWT app, or whatever. Various configuration and release management tools (e.g., uDeploy, DeployIt, Chef) work in this fashion, and with good reason. Here are some advantages of the approach.

Keeps everything automatable. This is by far the most important benefit. Ultimately you’re establishing a CMDB because you want to integrate and automate, so the data needs to be readily accessible to automation. If you treat the UI as just another client of a web service API, then you have an architecture that forces you to implement everything as a service, with the result that you can automate anything that the UI can do. And it will also force you to establish a proper discipline around defining and evolving the API, which is critical to your effort’s overall success.

Makes the API more robust. The CMDB’s API is the heart and soul of your devops platform, and cleaving away the UI makes the API more robust. Somebody won’t accidentally break the API while making a change to the UI.

Allows the UI and API to scale independently. In the early days it may be that most of what you do with your CMDB is user-driven, but the goal over time is once again to automate your environment. Eventually you should expect that your API will get lots of traffic because you have automation making all kinds of requests (reads and writes) against it, and the UI will receive significantly less use. It’s useful to be able to scale those independently since they have different expected load patterns.

Facilitates testing. In some cases it may be easier to test your API if the UI is separate. You don’t have to worry about interactions with UI components, UI configuration and the like.

Posted in Architecture, Devops principles | Tagged , , , , , , , | Leave a comment

Tried everything and SSH with PKA still not working?

I recently ran into the situation in which I couldn’t get PKA to work when SSHing into my Ubuntu server. I checked the key pair (works fine SSHing into other servers), directory permissions, /etc/ssh/sshd_config, /var/log/auth.log, all that. Ran ssh -vvv but nothing obvious other than the server wasn’t accepting my PKA authentication. I’m not a systems guy, but I’ve set this up often enough that I couldn’t figure out for the life of me why it kept going for password authentication.

Finally found the answer: my home directory is encrypted. SSH can’t read the ~/.ssh/authorized_keys file until I log in, so it rejects the PKA auth and goes to password.

The solution is to place the authorized_keys file in an alternative location (e.g., /etc/ssh/<username>/authorized_keys), reconfigure sshd_config to use that location, set permissions, and restart the SSHD server. It’s here, under Troubleshooting.

Here’s another post on the same topic as well.

Hope that helps somebody out. It was driving me bonkers.

Posted in Quick tips | Tagged , , , | Leave a comment