If the average response time is low, things are OK, and its high or spiking, you have a problem.
Unfortunately, this turns out to be a pretty bad way of tracking website performance.
At LeanSentry, we had to find a better way to monitor website performance for our users. This post is about the common fallacies of monitoring website performance with simple metrics like the avg. response time, and the approaches we chose instead.
Here are the main reasons why avg. latency causes more harm than good:
Reason #1: It hides important outliers.
Imagine your application is serving 100 requests / minute:
A. 60 of those requests have a usual latency of 200ms.
B. 20 are really fast at 10ms (they got served from the cache).
C. 20 are hitting a database lock and are taking 10 seconds.
Your average latency for those requests will be 2.1 seconds. Looks fine, right?
Wrong. 20% of your users are hitting unacceptable load times and are leaving your site.
This is a classic effect of averages, which is usually OK when you care primarily about the quantity being averaged itself.
For example, if you were talking about the avg. daily revenue on your ecommerce website, because you really mostly care about the total amount of money you made in a month and not the exact sales during each day. If the averages got you to the total you needed with no particularly bad sales days, it wouldn’t matter exactly how you arrived there.
However, here you care more about how many users experienced unacceptable delays regardless of how fast other users’ experience was … so the average site latency is a bad predictor of user “satisfaction” given that 20% of your users are unhappy with your site. It’s not like a user that received a particular snappy load time will somehow compensate for the other leaving your site because it’s too slow.
And the worst part of it is … imagine that the 20% of your requests that are taking too long to load are actually all to your Signup, Login, or Checkout pages. Your site is 100% unusable, but looking at the average latency tells you everything is fine because requests to your non-critical pages (e.g. fancy footer images) outnumber requests to your key pages 4:1.
Reason #2: What latency are you really measuring?
Different URLs in your application have completely different concepts of what’s “normal” latency and what isn’t.
For example, if an image that is typically kernel cached usually takes 2ms to serve, and all of a sudden it takes 5 seconds, that is probably a cause for concern. However, if your login page takes 5 seconds to authenticate the user, that may be not so bad!
When you average them together, you are getting a frankenstein latency number that doesn’t really tell you whether your site is doing well, poorly, better, or worse, because you don’t know what that latency is for.
It’s even dangerous to compare the latency for a specific site over time. If any proportions of the request traffic change (e.g. there are 4 more images on your homepage), it completely changes the expected avg. latency for the site as a whole.
Reason #3: It’s really skewed.
The average latency on the site just spiked to 30 seconds! Something is terribly wrong, you must restart your database and recycle your web servers immediately to fix it.
(NOTE: I use this as an example of a paniced reaction to spiking response times on a production site. I DO NOT actually recommend to recycle your servers as a general solution to latency spikes.)
Wrong again. Turns out, this happened:
A. One request was made to /Products/PurplePantaloons, which triggered a cache miss (because you don’t actually sell Purple Pantaloons) and then got stuck in some almost-never-hit database query to log a missing product id for 5 minutes.
B. Your other 10 requests to other pages on your site completed normally in 200ms on average.
Your average latency therefore was ~30 seconds!
The problem again is that averages are meant more for averaging the total quantity being measured, not the number of samples. So the average latency does not tell you how many users were affected by the problem. The bigger the outlier, the more the avg. latency number appears to reflect a huge global problem regardless of the fact that it is still, only, 1 outlier.
But we still need a global metric to measure site performance … how can we do better?
Thats right. We still wanted something to serve as a Key Performance Indicator (KPI), so we can make go/no-go decisions about site performance … Without having to resort to tables of URLs or search through request logs looking for outliers (imagine how much later we’d be home then!).
We can do better by recognizing that we don’t really care as much about the latency itself, as much as we care about the number of users affected by performance problems on the site. We want to know how well the site is doing overall, as well as pick up on any outliers where some subset of the users may be affected even if the rest of them is fine.
Turns out there is already a metric that does it fairly well – the industry standard Apdex index. Apdex was created by Peter Sevcik of Apdex Alliance (apdex.org), and is used by several existing APM tools today.
The Apdex index is calculated like this:
ApdexT = (# Satisfied requests + # Tolerating requests / 2) / (# Total requests)
T is the selected latency threshold. Where a request is considered satisfied if its latency is < T, tolerated if it’s between T and T * 4, and frustrated otherwise. So, if T is 2 seconds, a request that completes under 2 seconds leaves the user satisfied, requests that take between 2 and 8 seconds are tolerated by users, and requests greater than 8 seconds are considered too slow.
Lets look at our examples:
A. 60% of requests take 200ms, 20% take 10ms, and 20 percent are getting unacceptable latencies at 10 seconds.
Avg. latency is 2.1s (false sense of confidence), Apdex2s is 80%. Nailed it!
B. 1 request takes 5 minutes, 10 requests take 200ms.
Avg. latency is 30 seconds (ring the alarm!), Apdex2s is 91% (good).
The Apdex index does a good job of providing that top-level sense of how well your site is doing, does a good job of highlighting whenever even a small percent of your users are having slow performance, and does not allow itself to be skewed by requests with huge latencies.
Making it better
When making LeanSentry, we decided to build on Apdex as a top-level website performance metric. However, we still had the problem of different URLs having different latency requirements. We also wanted to make Apdex easier to understand for our users, and provide a pathway to dig into performance problems after the top-level score highlighted them.
Here are the improvements we made:
- Instead of using the Apdex index, we decided to use a slightly modified Satisfaction Score, which is essentially the Apdex index converted into a percentage for easier understanding. Also, our satisfaction score is not qualified with the T subscript, because it represents requests to multiple URLs in the application which have different T thresholds (see the next point).
- We classified each request into the standard Apdex categories of satisfied, tolerated, or frustrated, but based on the separately set latency threshold T of each major URL in your application (to solve the problem of different URLs having different acceptable latencies). For the Satisfaction Score of the entire site, we apply the formula over all requests, that have already been pre-categorized using their individual thresholds.
This way, when our users log into LeanSentry, they can quickly assess the overall impact of website performance on their application. Then, they can also visualize the number of slow requests over time in relation to other events in the application.
- Finally, we take advantage of this concept throughout the troubleshooting process, allowing the user to drill through to see which URLs had slow or failed requests, and then diagnose the source of the slowdown on each URL.
You should strongly consider moving away from using average latency to metrics that help identify the true impact of slow performance on your application. The Apdex index, or our version, the Satisfaction Score, is a good way achieve that in any environment.
The more general takeaway though is, it is important to be careful when using averages to motivate assumptions about how things work, because averages can hide critical outliers. By the same token, the common approach of using isolated occurrences (e.g. entries found in the request log) to make assumptions is also bad, because the occurrences often represent isolated cases with little significance.
For LeanSentry, our answer was to do both – use the Satisfaction Score to characterize overall performance without losing the important outliers, and then track down slow requests across application URLs to diagnose problems in detail.
I””ve replied here:
@cr3, I believe you are really confused about the post! I addressed your feedback on your reddit link.
1. We did not invent the Apdex index. It was created by Peter Sevcik of Apdex Alliance, and since adopted by a number of APM products.
2. The Satisfaction Score we use is derived from the Apdex index, but accommodates different latency thresholds for different URLs, and is also expressed as a percentage instead of a decimal for better readability.
3. I am not saying that avg. response time is a bad metric. Just that it is often used in bad ways that lead to incorrect interpretation of reality, esp. when its used as a performance metric for the entire site.
Do you disagree with our logic? Do you use another metric at the toplevel? Please tell us!
You raise some good points. Implementing this guidance can be tougher though given the lack of tool support.
I agree, the metrics (and therefore the quality of analysis) is heavily determined by the tools that OPs have access to. So if all you get is the avg. response time or another highlevel metric, it definitely encourages some of these bad assumptions.
This is the case with all analytics through. For example, we struggled with getting actionable data for website analytics despite having access to Google Analytics, Mixpanel, etc. You sort of have to build your own analysis tools if you want to make the right conclusions given your environment.
You can do this reasonably well by writing some custom code using LogParser. You””d have to parse your request logs, and take care of properly setting the latency expectations for each of your URLs as I described above. One caveat with logparser (as we found out while building LeanSentry) is that it really doesnt perform well for realtime log parsing, especially over the network. But you could probably make it work by aggregating your logfiles to a central location first and parsing them after a delay.
Some of the recent APM tools also provide better metrics, like the Apdex index. New Relic, and of course LeanSentry, are some of the examples.
If you need more pointers for using LeanSentry or how to implement something of your own, feel free to shoot us an email at support AT leansentry.com.
Mike, a great writeup! Glad to see you blogging again.
One point worth remembering before we go into complex indexes: calculating the “average” of anything is a really bad idea. Edge cases will skew the “average” so far it will be useless, and won”t show the actual picture. What you want to use is the “median”, or even the “mode”. And learn statistics: https://en.wikipedia.org/wiki/Median
@fo55, Definitely echo the statement about averages. However, my attempt was about more than not using averages – but also about not using statistical aggregates f the latency itself vs. focusing on the effect latency has on the site visitors.
This is where the 95% percentile metric can also fail to convey the extent to which your users are affected by latency outliers.
In other words, learn statistics, but also learn what you are trying to measure 🙂
Excellent analysis, thank you!
For your scheme to work, you need predetermined value of T, of course.
I have had good success analyzing problem latencies without using an apriori target performance by keeping track of average, 95th percentile, best, and worst response times in five-minute chunks. By 95th percentile, I mean the response time of the fifth worst performing operation out of 100.
I found that plotting the 95th-percentile response over time gave a very accurate prediction of system problems like DBMS congestion and router overload. It tossed out the anomalous outliers like your 5-minute query, but still considered the slow stuff.
Ran into a similar scenario looking at Stack Overflow / Stack Exchange”s performance (average hides problems): http://blog.serverfault.com/2011/07/25/a-non-foolish-consistency/
Didn”t take it as far as this though – Apdex is a very cool metric.
@Kyle Just read your post, I completely agree with your take on “consistency”. Most performance problems in a mature application are outliers, and not site-wide events that show up in aggregate (unless of course you get your EC2-small-instance wordpress blog on Hacker News without output caching, aghem).
Good points on why average latency is a bad metric, and while the idea behind Apdex was good, it never ended up being the right measure. The Apdex score still depends on a HiPPO (Highest Paid Person”s Opinion) to determine what T should be, and this can change over time.
At SOASTA (and previously at LogNormal), we borrowed the concept of LD50 (the median lethal dose) from biology. The LD50 value has the property of adapting to what your audience thinks rather than what your HIPPO thinks is a good experience.
We described the method at the Velocity conferences (Santa Clara and London) last year, and wrote it up in a blog post here: http://www.lognormal.com/blog/2012/10/03/the-3.5s-dash-for-attention/#ld50
Hope you find it interesting.
@Philip, I enjoyed reading your post, LD50 is a nice approach to determining a “tolerance” threshold statistically and tying it to what really matters. This definitely raises some interesting ideas about helping users quantitatively arrive at proper latency thresholds based on actual abort or goal conversion rates for the app.
Would love to chat more with you about this anytime. mike at leanserver dot com.
Avg is meant to be used to summarize symmetrical data. For any other type you can use median or specials means for another distributions.
Also if outliers are a problem you goal is to narrow sd.
A bit late to this party Mike.
But what is confusing me here is the apparent understanding/lack of mentioning of the different connection speeds of the clients has on these figures.
I presume you are looking at timetaken for the generating the duration in these examples.
Being a fellow IIS MVP like yourself we understand that in IIS the timetaken includes (apart from a tiny fraction of cases from certain caching scenarios) network time.
This causes IIS to report to some misleading metrics for determining if the app is slow or not.
Now in the modern web world many are using mobile forms of connection. In fact I use a mobile internet connected to my PC at the moment rather than a broadband fixed connection.
And obviously many phones use this and the traffic from these devices is increasing.
And as you and many will know sometimes these are not the most reliable and can increase the time it takes to serve a page.
I think it is a least worth pointing out.
Now if I am doing detailed troubleshooting long running pages and wanted to go really hardcore I would lookup the IP address and compare this to mobile carriers to get a indication of if they are mobile connections or not. If they are then I would adjust my analysis as required and not take these slow connections as seriously.
You raise an excellent point re: how the “response time” is determined for the purposes of reporting and satisfaction classification. Ideally we probably want to look at TTFB, but we usually only get the sort of “TTLB” for when IIS receives the network send completion from http.sys/client.
I agree that clients with slow connections can sometimes drag down this number. We especially see this for our international clients, where slow network speeds are more the norm. In the US, it seems to be less of a problem because most network connections are fast enough for small pages (again with larger files it can become an issue).
We do have ways that we can try to establish more of the processing time of the response vs. the send time, but in practice we dont do this today because getting this information on a 100% covering basis would be WAY more intensive than what we do with logfile monitoring.
Regardless, great point though. I wish I had covered this in the post.
Absolutely love this approach to measuring, whilst I think average and 95% are adequate measures when taken together – I think this is altogether much more accessible to an end client (Which is key). My only concern is that the graphing may be difficult / not scalable where you have many transaction types with many different SLA”s. ps Most response times in standard industry tools are measured by taking the TTFB.
Alternative Metrics for Latency Performance Monitoring
[…] Volodarsky, Mike. “Why Average Latency Is a Terrible Way to Track Website Performance … and … […]