18 Comments

  1. Some clarifications:

    1. We did not invent the Apdex index. It was created by Peter Sevcik of Apdex Alliance, and since adopted by a number of APM products.

    2. The Satisfaction Score we use is derived from the Apdex index, but accommodates different latency thresholds for different URLs, and is also expressed as a percentage instead of a decimal for better readability.

    3. I am not saying that avg. response time is a bad metric. Just that it is often used in bad ways that lead to incorrect interpretation of reality, esp. when its used as a performance metric for the entire site.

    Do you disagree with our logic? Do you use another metric at the toplevel? Please tell us!

  2. @Sean, thanks.

    I agree, the metrics (and therefore the quality of analysis) is heavily determined by the tools that OPs have access to. So if all you get is the avg. response time or another highlevel metric, it definitely encourages some of these bad assumptions.

    This is the case with all analytics through. For example, we struggled with getting actionable data for website analytics despite having access to Google Analytics, Mixpanel, etc. You sort of have to build your own analysis tools if you want to make the right conclusions given your environment.

    You can do this reasonably well by writing some custom code using LogParser. You””d have to parse your request logs, and take care of properly setting the latency expectations for each of your URLs as I described above. One caveat with logparser (as we found out while building LeanSentry) is that it really doesnt perform well for realtime log parsing, especially over the network. But you could probably make it work by aggregating your logfiles to a central location first and parsing them after a delay.

    Some of the recent APM tools also provide better metrics, like the Apdex index. New Relic, and of course LeanSentry, are some of the examples.

    If you need more pointers for using LeanSentry or how to implement something of your own, feel free to shoot us an email at support AT leansentry.com.

  3. One point worth remembering before we go into complex indexes: calculating the “average” of anything is a really bad idea. Edge cases will skew the “average” so far it will be useless, and won”t show the actual picture. What you want to use is the “median”, or even the “mode”. And learn statistics: https://en.wikipedia.org/wiki/Median

    • @fo55, Definitely echo the statement about averages. However, my attempt was about more than not using averages – but also about not using statistical aggregates f the latency itself vs. focusing on the effect latency has on the site visitors.

      This is where the 95% percentile metric can also fail to convey the extent to which your users are affected by latency outliers.

      In other words, learn statistics, but also learn what you are trying to measure :)

  4. Ollie Jones

    Excellent analysis, thank you!

    For your scheme to work, you need predetermined value of T, of course.

    I have had good success analyzing problem latencies without using an apriori target performance by keeping track of average, 95th percentile, best, and worst response times in five-minute chunks. By 95th percentile, I mean the response time of the fifth worst performing operation out of 100.

    I found that plotting the 95th-percentile response over time gave a very accurate prediction of system problems like DBMS congestion and router overload. It tossed out the anomalous outliers like your 5-minute query, but still considered the slow stuff.

    • @Kyle Just read your post, I completely agree with your take on “consistency”. Most performance problems in a mature application are outliers, and not site-wide events that show up in aggregate (unless of course you get your EC2-small-instance wordpress blog on Hacker News without output caching, aghem).

  5. Hi Mike,

    Good points on why average latency is a bad metric, and while the idea behind Apdex was good, it never ended up being the right measure. The Apdex score still depends on a HiPPO (Highest Paid Person”s Opinion) to determine what T should be, and this can change over time.

    At SOASTA (and previously at LogNormal), we borrowed the concept of LD50 (the median lethal dose) from biology. The LD50 value has the property of adapting to what your audience thinks rather than what your HIPPO thinks is a good experience.

    We described the method at the Velocity conferences (Santa Clara and London) last year, and wrote it up in a blog post here: http://www.lognormal.com/blog/2012/10/03/the-3.5s-dash-for-attention/#ld50

    Hope you find it interesting.

    • @Philip, I enjoyed reading your post, LD50 is a nice approach to determining a “tolerance” threshold statistically and tying it to what really matters. This definitely raises some interesting ideas about helping users quantitatively arrive at proper latency thresholds based on actual abort or goal conversion rates for the app.

      Would love to chat more with you about this anytime. mike at leanserver dot com.

  6. dvd

    Avg is meant to be used to summarize symmetrical data. For any other type you can use median or specials means for another distributions.
    Also if outliers are a problem you goal is to narrow sd.

  7. Rovastar

    A bit late to this party Mike.

    But what is confusing me here is the apparent understanding/lack of mentioning of the different connection speeds of the clients has on these figures.

    I presume you are looking at timetaken for the generating the duration in these examples.

    Being a fellow IIS MVP like yourself we understand that in IIS the timetaken includes (apart from a tiny fraction of cases from certain caching scenarios) network time.

    This causes IIS to report to some misleading metrics for determining if the app is slow or not.

    Now in the modern web world many are using mobile forms of connection. In fact I use a mobile internet connected to my PC at the moment rather than a broadband fixed connection.
    And obviously many phones use this and the traffic from these devices is increasing.

    And as you and many will know sometimes these are not the most reliable and can increase the time it takes to serve a page.

    I think it is a least worth pointing out.

    Now if I am doing detailed troubleshooting long running pages and wanted to go really hardcore I would lookup the IP address and compare this to mobile carriers to get a indication of if they are mobile connections or not. If they are then I would adjust my analysis as required and not take these slow connections as seriously.

    John

    • Hey John,

      You raise an excellent point re: how the “response time” is determined for the purposes of reporting and satisfaction classification. Ideally we probably want to look at TTFB, but we usually only get the sort of “TTLB” for when IIS receives the network send completion from http.sys/client.

      I agree that clients with slow connections can sometimes drag down this number. We especially see this for our international clients, where slow network speeds are more the norm. In the US, it seems to be less of a problem because most network connections are fast enough for small pages (again with larger files it can become an issue).

      We do have ways that we can try to establish more of the processing time of the response vs. the send time, but in practice we dont do this today because getting this information on a 100% covering basis would be WAY more intensive than what we do with logfile monitoring.

      Regardless, great point though. I wish I had covered this in the post.

  8. Absolutely love this approach to measuring, whilst I think average and 95% are adequate measures when taken together – I think this is altogether much more accessible to an end client (Which is key). My only concern is that the graphing may be difficult / not scalable where you have many transaction types with many different SLA”s. ps Most response times in standard industry tools are measured by taking the TTFB.

Leave a Reply

Your email address will not be published. Required fields are marked *