Top ten pops and drops with Splunk, part 1

In real-time operational contexts, it’s important to monitor system metrics for major swings in value. Such swings, whether upward (“pops”) or downward (“drops”), often indicate a problem either on the horizon or else in progress.

This can be a little tricky in cases where we have a vector of closely-related metrics that compete for attention, because we have to decide how to resolve that competition. Consider for example application events. We may have hundreds or even thousands of event codes. If one of them is having huge day-over-day pops in volume all day long, then that’s easy–the event code makes the top ten list. If another event code is stable, then that’s easy too–it doesn’t make the list. But what happens when we have an event code that has largish pops all day long and another event code that has a single huge spike in volume? Which wins?

This post is the first in a two-part series that explains how to identify the most volatile components in a vector and then visualize their respective time series. In this first post we’ll focus on the identification part, and in the second one we’ll do the time series visualization. Ultimately we want a single chart that shows for each of the top ten its corresponding pop/drop time series (e.g., against the hour of the day), like so:

We also want to be able to zero in on any given event:

(In the charts above I’ve suppressed the labels from the legend.)

Let’s dive in.

Measuring change

Ultimately we want to measure volatility, which is roughly aggregate change, whether upward or downward, over some given time period. But to get there we need to start by measuring change in its unit form.

To keep things concrete we’ll look at day-over-day changes broken down by hour. But the same concepts apply independently of the timescales and timeframes involved.

Also for concreteness, we’ll focus on a vector of event codes. But again the concepts are much more general. Here are some examples:

  • the view counts for a set of web pages
  • the response times for a set of service calls
  • the lead submission counts for a set of lead vendors

We can measure change for really any numeric quantity we like. To keep things exciting (not that this isn’t already exciting enough) we’ll walk through the various attempts I made here, since the reasons for rejecting the earlier attempts motivate the final approach.

Attempt #1: Ratios. Ratios are a good starting point for measuring change: what’s the ratio of events (of some given type) today during the 2am hour as compared to yesterday, for instance? Ratios are better than raw counts here because they have a normalizing effect.

The challenge with ratios is that they’re asymmetric about r = 1.0. Pops start at r > 1.0, and they can grow without bound. Drops on the other hand are bounded by the interval (0, 1). So when you see r = 23.0 and r = 0.0435, it’s not immediately clear that the changes involved differ in direction but not magnitude.

Attempt #2: Log ratios. One potential solution is to use log ratios–say log base 2 since we’re computer people here. Log ratios fix the asymmetry: log_2(23) = 4.5236, whereas log_2(1/23) = -4.5236.

Log ratios aren’t perfect either though. Most of us, despite being computer people, are probably more comfortable reasoning with ratios than we are with log ratios. If you tell me that the log ratio for some event code is 5, I can probably translate that to a ratio of 32/1 easily enough. But if you tell me the log ratio is 13.2, I’m not personally able to arrive at the actual ratio as quickly as I’d like.

Using base 10 might make it a little easier, but bottom line is that it’s more intuitive to talk about the actual ratios themselves.

Final solution: Modified ratios. Our approach will be to use signed, unbounded ratios. The numerator is always the larger quantity. For instance, if yesterday there were 100 events of a type and today there were 500, then the change score is 5.0 (a pop). If instead there were 500 events yesterday and only 100 today, then the change score is -5.0 (a drop).

This approach is pretty intuitive. If I tell you that the change score was -238.5, it’s immediately clear what that means: yesterday there were 238.5 times more events during some given hour than there were today. Pretty big drop, ratio-wise.

Now we have a change score. In the next section we’ll see to aggregate hour-by-hour change scores into an overall measure of volatility.

Measuring volatility

Recall that we have a vector of change scores here, and it might be pretty large. We need to know which ten or so of the hundreds or thousands possible are most deserving of our attention.

The approach again is to score each event code. We’ll create a volatility score here since we want to see the event codes with “more change” in some to-be-defined sense. In some cases we may care more about pops (or drops) than we do about volatility in general. Response times are a good example. That’s fine–it is easy to adapt the approach below to such cases.

For volatility, we’re going to use a sum of squares. The approach is straightforward: for any given event code, square each of its hourly change scores, and then add them all up. That’s volatility.

The sum of squares is nice for a couple of reasons:

  • Pops and drops contribute equally to overall volatility, since change scores for pops and drops are symmetric, and since the square of a change score is always positive.
  • Squaring the change scores amplifies the effect of large change scores. (If we had change scores in the interval (-1.0, 1.0) then it would also diminish their effect, which is sometimes useful. But in our case there are no such change scores.) So a huge pop that happened only once is still likely to make it into the top ten.

That’s the approach. It’s time to see the Splunk query and the result.

Splunk it out

Here’s a Splunk search query that yields today’s top ten most volatile event codes, measuring changes day-over-day:

index=summary reporttype=eventcount reporttime=hourly [email protected] latest=now
    | eval marker = if(_time < relative_time(now(), "@d"), "yesterday", "today")
    | stats sum(count) as jfreq by date_hour EventCode marker
    | eval tmp_yesterday = if(marker == "yesterday", jfreq, 0)
    | eval tmp_today = if(marker == "today", jfreq, 0)
    | stats max(tmp_yesterday) as yesterday, max(tmp_today) as today by date_hour EventCode
    | eval change = if(today >= yesterday, today / yesterday, -(yesterday / today))
    | stats sumsq(change) as volatility by EventCode
    | sort volatility desc
    | head 10

Note that we’re using a Splunk summary index here. This is important since otherwise the query will take too long.

Technically, here’s what the query does:

  1. Grab the (trivariate) joint frequency distribution over the variables (1) hour of day, (2) event code and (3) today-vs-yesterday.
  2. Extract change scores against hour of day and event code, considered jointly. Recall that this is our modified ratio involving today’s count and yesterday’s.
  3. For each event code, aggregate change scores into a volatility score as we described above.
  4. Show the ten event codes having the highest volatility.

You can see the effect of individual filters by peeling them off from the end of the query. See the Splunk Search Reference for more information on what’s happening above.

Looking at the chart view yields the following:

Along the x-axis we have a bunch of event codes (I’ve suppressed the labels), and on the y-axis we have volatility. This chart makes it immediately clear which ten of the hundreds of event codes are seeing the most action.

Conclusion

Even though this isn’t the whole story, what we have so far is already pretty useful. It tells us at any given point at time which event codes are most deserving of our attention from a volatility perspective. This is important for real-time operations.

Recall that we can generalize this to other sorts of metric as well, and also to different timescales and timeframes.

Geek aside: There’s a relationship between what we’re doing here and linear regression from statistics. With linear regression, we start with a bunch of data points and look for a linear model that minimizes the sum of squared errors. We’re doing something loosely analogous here, but in the opposite direction. We start with a linear baseline (change score = 1, representing no change–this isn’t precisely correct but it’s the basic idea we’re after here), and then treat each event code as a set of data points (change score vs. hour of day) against that baseline. The event codes that maximize the sum of squares are the worst-fit event codes, and thus the ones most worthy of our attention.

In the next post we’ll see how to visualize the actual pops and drops time series for the top ten.

Acknowledgements

Thanks to Aman Bhutani for suggesting the pops and drops approach in the first place, and to both Aman Bhutani and Malcolm Jamal Anderson for helpful discussions and code leading to the work above.

This entry was posted in Big data, Operations and tagged , , , , , . Bookmark the permalink.

One Response to "Top ten pops and drops with Splunk, part 1"

Leave a reply