Average latency can be a misleading way to monitor website performance

How do you know if your website is running slowly? To answer this question, we often spot check a single metric ... and that metric is often the Average Response Time.
If the average response time is low, things are OK, and its high or spiking, you have a problem.

Unfortunately, this turns out to be a pretty bad way of tracking website performance.

At LeanSentry, we had to find a better way to monitor website performance for our users. This post is about the common fallacies of monitoring website performance with simple metrics like the avg. response time, and the approaches we chose instead.

(Note: We've gotten a lot of responses saying "duuuh, averages are bad!" and "use 95% percentile". This post is about more than these simple statements. Its is about the common logic errors that result, and discussion of how to select metrics that help you make better decisions.)



Here are the main reasons why avg. latency causes more harm than good:

Reason #1: It hides important outliers.

Average latency is fine, but we have 20% slow requests!

Imagine your application is serving 100 requests / minute:

A. 60 of those requests have a usual latency of 200ms.
B. 20 are really fast at 10ms (they got served from the cache).
C. 20 are hitting a database lock and are taking 10 seconds.

Your average latency for those requests will be 2.1 seconds. Looks fine, right?

Wrong. 20% of your users are hitting unacceptable load times and are leaving your site.

This is a classic effect of averages, which is usually OK when you care primarily about the quantity being averaged itself.

For example, if you were talking about the avg. daily revenue on your ecommerce website, because you really mostly care about the total amount of money you made in a month and not the exact sales during each day. If the averages got you to the total you needed with no particularly bad sales days, it wouldn’t matter exactly how you arrived there.

However, here you care more about how many users experienced unacceptable delays regardless of how fast other users’ experience was … so the average site latency is a bad predictor of user “satisfaction” given that 20% of your users are unhappy with your site. It's not like a user that received a particular snappy load time will somehow compensate for the other leaving your site because it's too slow.

And the worst part of it is … imagine that the 20% of your requests that are taking too long to load are actually all to your Signup, Login, or Checkout pages. Your site is 100% unusable, but looking at the average latency tells you everything is fine because requests to your non-critical pages (e.g. fancy footer images) outnumber requests to your key pages 4:1.

Reason #2: What latency are you really measuring?

Different URLs in your application have completely different concepts of what’s “normal” latency and what isn’t.

For example, if an image that is typically kernel cached usually takes 2ms to serve, and all of a sudden it takes 5 seconds, that is probably a cause for concern. However, if your login page takes 5 seconds to authenticate the user, that may be not so bad!

When you average them together, you are getting a frankenstein latency number that doesn’t really tell you whether your site is doing well, poorly, better, or worse, because you don’t know what that latency is for.

It’s even dangerous to compare the latency for a specific site over time. If any proportions of the request traffic change (e.g. there are 4 more images on your homepage), it completely changes the expected avg. latency for the site as a whole.

Reason #3: It's really skewed.

Average latency spiking - site-wide problem or outlier?

The average latency on the site just spiked to 30 seconds! Something is terribly wrong, you must restart your database and recycle your web servers immediately to fix it.

(NOTE: I use this as an example of a paniced reaction to spiking response times on a production site. I DO NOT actually recommend to recycle your servers as a general solution to latency spikes.)

Wrong again. Turns out, this happened:

A. One request was made to /Products/PurplePantaloons, which triggered a cache miss (because you don’t actually sell Purple Pantaloons) and then got stuck in some almost-never-hit database query to log a missing product id for 5 minutes.

B. Your other 10 requests to other pages on your site completed normally in 200ms on average.

Your average latency therefore was ~30 seconds!

The problem again is that averages are meant more for averaging the total quantity being measured, not the number of samples. So the average latency does not tell you how many users were affected by the problem. The bigger the outlier, the more the avg. latency number appears to reflect a huge global problem regardless of the fact that it is still, only, 1 outlier.

But we still need a global metric to measure site performance … how can we do better?

Thats right. We still wanted something to serve as a Key Performance Indicator (KPI), so we can make go/no-go decisions about site performance ... Without having to resort to tables of URLs or search through request logs looking for outliers (imagine how much later we’d be home then!).

We can do better by recognizing that we don’t really care as much about the latency itself, as much as we care about the number of users affected by performance problems on the site. We want to know how well the site is doing overall, as well as pick up on any outliers where some subset of the users may be affected even if the rest of them is fine.

Turns out there is already a metric that does it fairly well - the industry standard Apdex index. Apdex was created by Peter Sevcik of Apdex Alliance (apdex.org), and is used by several existing APM tools today.

The Apdex index is calculated like this:

ApdexT = (# Satisfied requests + # Tolerating requests / 2) / (# Total requests)

T is the selected latency threshold. Where a request is considered satisfied if its latency is < T, tolerated if it's between T and T * 4, and frustrated otherwise. So, if T is 2 seconds, a request that completes under 2 seconds leaves the user satisfied, requests that take between 2 and 8 seconds are tolerated by users, and requests greater than 8 seconds are considered too slow.

Lets look at our examples:

A. 60% of requests take 200ms, 20% take 10ms, and 20 percent are getting unacceptable latencies at 10 seconds.

Avg. latency is 2.1s (false sense of confidence), Apdex2s is 80%. Nailed it!

B. 1 request takes 5 minutes, 10 requests take 200ms.

Avg. latency is 30 seconds (ring the alarm!), Apdex2s is 91% (good).

The Apdex index does a good job of providing that top-level sense of how well your site is doing, does a good job of highlighting whenever even a small percent of your users are having slow performance, and does not allow itself to be skewed by requests with huge latencies.

Making it better

When making LeanSentry, we decided to build on Apdex as a top-level website performance metric. However, we still had the problem of different URLs having different latency requirements. We also wanted to make Apdex easier to understand for our users, and provide a pathway to dig into performance problems after the top-level score highlighted them.

Here are the improvements we made:

  1. Instead of using the Apdex index, we decided to use a slightly modified Satisfaction Score, which is essentially the Apdex index converted into a percentage for easier understanding. Also, our satisfaction score is not qualified with the T subscript, because it represents requests to multiple URLs in the application which have different T thresholds (see the next point).

  2. We classified each request into the standard Apdex categories of satisfied, tolerated, or frustrated, but based on the separately set latency threshold T of each major URL in your application (to solve the problem of different URLs having different acceptable latencies). For the Satisfaction Score of the entire site, we apply the formula over all requests, that have already been pre-categorized using their individual thresholds.

    Satisfaction breakdown for an application Track slow, failed, and sluggish requests over time

    This way, when our users log into LeanSentry, they can quickly assess the overall impact of website performance on their application. Then, they can also visualize the number of slow requests over time in relation to other events in the application.

  3. Finally, we take advantage of this concept throughout the troubleshooting process, allowing the user to drill through to see which URLs had slow or failed requests, and then diagnose the source of the slowdown on each URL.


    Track down slow or failed requests on each URL

In conclusion

You should strongly consider moving away from using average latency to metrics that help identify the true impact of slow performance on your application. The Apdex index, or our version, the Satisfaction Score, is a good way achieve that in any environment.

The more general takeaway though is, it is important to be careful when using averages to motivate assumptions about how things work, because averages can hide critical outliers. By the same token, the common approach of using isolated occurrences (e.g. entries found in the request log) to make assumptions is also bad, because the occurrences often represent isolated cases with little significance.

For LeanSentry, our answer was to do both - use the Satisfaction Score to characterize overall performance without losing the important outliers, and then track down slow requests across application URLs to diagnose problems in detail.