Mike Volodarsky's blog

Formerly the core server PM for IIS 7.0 and ASP.NET, now I run LeanSentry.
UPDATES: New troubleshooting guide released! Fix IIS website hangs →

How I Over-Engineered the ASP.NET Health Monitoring Feature

ASP.NET Health Monitoring was one of the big features I worked on for the ASP.NET 2.0 release.  It was going to be the solution for troubleshooting and supporting ASP.NET apps in production … it also was one of the most misunderstood and arguably most over-engineered features we built.

Fast forward 8 years later, after releasing ASP.NET, IIS7, and building LeanSentry.  This is the story of this feature, lessons learned while building it, and a practical take on when to use/not to use Health Monitoring for monitoring your ASP.NET applications.

What is Health Monitoring?

Health Monitoring is a set of ASP.NET runtime events for monitoring health, performance, and security, and a flexible provider-based framework for routing these events to email, eventlog, SQL server, and any other destination.

ASP.NET Health Monitoring feature

I don’t cover the basics of the feature here.  See links at the end for more about it.

The origins

Scott Guthrie’s vision for ASP.NET 2.0 was very ambitious, building on the success of ASP.NET 1.1 programming model and looking to fundamentally improve virtually every aspect of the framework.

If you look at the features we shipped in 2.0, you can clearly see that.   Nearly every aspect of the framework received a huge upgrade: adding features like Membership, Role Management, Personalization, Provider model, Master pages, Code Beside, Themes, Skins, and so on.  Supportability was also one of the key pillars for ASP.NET 2.0 - Scott wanted to make ASP.NET apps much easier to monitor and troubleshoot.

Having just joined the ASP.NET team, I was the PM for the feature. At that point, all that developers/operations people had to go on were custom error pages, and the ASP.NET trace viewer (does anyone still remember this?).  If a problem happened without you watching it, you would never know it … and never get the details you needed to troubleshoot it. We had nothing that solved the monitoring, alerting, and forensic troubleshooting needs for production applications. So, in my mind Health Monitoring was to become the all-in-one answer to these problems.

One day I stopped Scott in the hallway to give him an update on Health Monitoring.  I brought up performance - how we were working on reducing the logging overhead in production environments.  Scott smiled and nodded acceptingly … until I got to the fact that errors were not going to be logged by default to improve performance.

Then, his face changed and he sternly said “We MUST log errors by default”.  It was non-negotiable … and I’ve since come to understand that decision each time I had to log into a server and find the ASP.NET request exception details in the EventLog.

How we I overengineered the feature

One morning, I came into Patrick’s office (he was the developer on Health Monitoring) with a 25 page document called Buffer Modes.  It was a painfully detailed exploration of how people reacted to application problems, and the model I was proposing for shaping event flows to accommodate these situations.

I don’t remember the document in detail, but it was basically about breaking down app problems along dimensions like severity (how bad an error or problem is), rates and bursts of events, the information delivery properties of the transport medium (e.g. text should say 1 thing vs. email can say 20), and the modality of the user responding to the event (e.g. respond immediately to fix, or review all issues in the evening, etc).

It was a sickeningly thorough analysis of the possible scenarios, and it suggested an event shaping model we would use to express all the possible scenarios for any Health Monitoring event delivery provider.  If that wasn't enough, the night before I had written a WinForms app that graphically simulated the different event patterns, and how they would be shaped/delivered with different buffer modes.  I demoed the app to Patrick and our tester, and they were certainly impressed.

This meeting set the tone for the rest of the feature, sending us down a path of building a super-comprehensive and completely-flexible eventing framework that few people would eventually use.

Health Monitoring 8 years later

If you go on stackoverflow.com and read about people asking whether they should use Health Monitoring, ELMAH, log4net, or the Enterprise Logging Block, you will rarely find people advocating the use of Health Monitoring.  Its not because they have anything bad to say about it, in fact most of the time they don’t have anything to say - other than saying that they just ended up going with ELMAH (partially thanks to Hanselman’s endorsement), or Log4Net.

8 years of working with real people running real web applications on the Microsoft web platform, here is my bottom line take on why this happens:

The biggest problems with Health Monitoring

1. It focuses on delivering the events, not reporting on them

In my opinion, this was our single biggest downfall. We created an eventing platform, not a monitoring platform.  The single hardest challenge of monitoring is how people consume the information, and make decisions based on it.  We could have made a more readable email report, and created a reporting plugin for InetMgr that people could use to aggregate and report on application health.  As we left it, someone had to write code to make something of our events … which made the feature useless for people looking for a ready-to-go monitoring solution.

The reason why people chose ELMAH over Health Monitoring is precisely because ELMAH makes it easy to report the errors the application encounters.  Hats off to you, Atif Aziz.

2. Its too big / flexible!

Its too big as a feature, and has too much configuration for people who are skimming the web for a good monitoring solution for their ASP.NET apps.  Sure, they could read 18 pages of MSDN documentation for every aspect of HM, or they can just install ELMAH and be done with it.  In today’s age of turnkey solutions, guess what they end up doing.

3. It got broken in ASP.NET MVC!

This took me by surprise, ASP.NET MVC broke the default behavior of logging ASP.NET errors to Health Monitoring.  You can read about it and how to fix it here.  This was clearly an oversight by the MVC team, and I am even more suprised that they havent addressed it in MVC 3.0.

4. And, as always, the lack of education

ASP.NET Health Monitoring events have a lot of useful information … However, they are not logged anywhere by default and therefore practically non-discoverable.  And even if you did log them, you’d have to learn quite a bit about what they mean and how to leverage them to improve your applications.  Its probably best to just leave them off, said everyone.

The good parts about Health Monitoring

1. It works by default

(Except in ASP.NET MVC, see above).  When you deploy an ASP.NET application, you can count on going to the Event Log and finding any errors in there.  Even if you dont control the code, and if you never configured anything. Believe it or not, this is probably the single most useful feature of it (thanks, Scott!).

2. It has some interesting info you cant get elsewhere

The Health Monitoring’s web events give some deep information about ASP.NET runtime behavior.  For example, there are events fired when:

  • The ASP.NET application recycles
  • Forms authentication tickets expiration
  • Compilation failures
  • Membership logins failures
  • etc

You can still use ELMAH for errors or log4net for application logging, and add in the ASP.NET runtime events from Health Monitoring.

3. Its extensible if you care enough to do it

For those that do want to leverage the extensive features in Health Monitoring, it offers a wealth of possibilities.  You can create and raise custom events with a rich hierarchy, and create flexible rules that route different events to different providers for delivery.  You can use the built in providers, customize the event flows with buffer modes, customize the email reporting template, write your own provider, or use connector providers from any other logging frameworks. By extending your application with Health Monitoring web events, you have the flexibility to use any of the built-in delivery channels or third party logging frameworks without taking an explicit dependency on third party logging instrumentation.

Should I use Health Monitoring or something else?

This is the question many are asking.  To keep my answer short, I would suggest:

  1. Use HM if you want to log / troubleshoot issues that could benefit from ASP.NET runtime events. You could still route them to your logging framework of choice.
  2. Use HM if you want to set up a very quick email report for errors in your application.  But, you probably want to use ELMAH instead for simple error logging.Of course, this breaks down VERY QUICKLY if you have a lot of traffic/errors.  At this point, you really need a monitoring solution that can properly aggregate thousands of events and let you identify the key problems across them (instead of bombarding you with hundreds of alerts).  In fact, this is one of the key reasons why we built LeanSentry.
  3. Use HM / EventLog on any server where you dont control the code, to find the errors that the application is throwing. This is my go-to when troubleshooting in production.
  4. Use a logging framework for custom application logging.  Log4net is popular.  I would not recommend Health Monitoring here because its too much of an overkill, and forces too much structure for custom application logging.

Lessons Learned

8 years and many large software projects later, the lessons learned have often been the same:

  1. Dont overengineer your monitoring solution.  If its too complicated, you wont benefit from it.
  2. When it comes to supporting production applications, getting the data is a (small) part of the problem. By far the biggest problem is reporting this data in a way that helps you make decisions and take action on it.

Now that I am older and wiser, we are trying to solve some of these problems again, this time with LeanSentry.

 


Resources:

1. Need to learn about the Health Monitoring feature? Read this and then this.

2. Looking for a monitoring solution for production ASP.NET applications that elevates the important patterns from the noise? Try LeanSentry.

Why average latency is a terrible way to track website performance … and how to fix it.

Average latency can be a misleading way to monitor website performance

How do you know if your website is running slowly? To answer this question, we often spot check a single metric ... and that metric is often the Average Response Time.
If the average response time is low, things are OK, and its high or spiking, you have a problem.

Unfortunately, this turns out to be a pretty bad way of tracking website performance.

At LeanSentry, we had to find a better way to monitor website performance for our users. This post is about the common fallacies of monitoring website performance with simple metrics like the avg. response time, and the approaches we chose instead.

(Note: We've gotten a lot of responses saying "duuuh, averages are bad!" and "use 95% percentile". This post is about more than these simple statements. Its is about the common logic errors that result, and discussion of how to select metrics that help you make better decisions.)



Here are the main reasons why avg. latency causes more harm than good:

Reason #1: It hides important outliers.

Average latency is fine, but we have 20% slow requests!

Imagine your application is serving 100 requests / minute:

A. 60 of those requests have a usual latency of 200ms.
B. 20 are really fast at 10ms (they got served from the cache).
C. 20 are hitting a database lock and are taking 10 seconds.

Your average latency for those requests will be 2.1 seconds. Looks fine, right?

Wrong. 20% of your users are hitting unacceptable load times and are leaving your site.

This is a classic effect of averages, which is usually OK when you care primarily about the quantity being averaged itself.

For example, if you were talking about the avg. daily revenue on your ecommerce website, because you really mostly care about the total amount of money you made in a month and not the exact sales during each day. If the averages got you to the total you needed with no particularly bad sales days, it wouldn’t matter exactly how you arrived there.

However, here you care more about how many users experienced unacceptable delays regardless of how fast other users’ experience was … so the average site latency is a bad predictor of user “satisfaction” given that 20% of your users are unhappy with your site. It's not like a user that received a particular snappy load time will somehow compensate for the other leaving your site because it's too slow.

And the worst part of it is … imagine that the 20% of your requests that are taking too long to load are actually all to your Signup, Login, or Checkout pages. Your site is 100% unusable, but looking at the average latency tells you everything is fine because requests to your non-critical pages (e.g. fancy footer images) outnumber requests to your key pages 4:1.

Reason #2: What latency are you really measuring?

Different URLs in your application have completely different concepts of what’s “normal” latency and what isn’t.

For example, if an image that is typically kernel cached usually takes 2ms to serve, and all of a sudden it takes 5 seconds, that is probably a cause for concern. However, if your login page takes 5 seconds to authenticate the user, that may be not so bad!

When you average them together, you are getting a frankenstein latency number that doesn’t really tell you whether your site is doing well, poorly, better, or worse, because you don’t know what that latency is for.

It’s even dangerous to compare the latency for a specific site over time. If any proportions of the request traffic change (e.g. there are 4 more images on your homepage), it completely changes the expected avg. latency for the site as a whole.

Reason #3: It's really skewed.

Average latency spiking - site-wide problem or outlier?

The average latency on the site just spiked to 30 seconds! Something is terribly wrong, you must restart your database and recycle your web servers immediately to fix it.

(NOTE: I use this as an example of a paniced reaction to spiking response times on a production site. I DO NOT actually recommend to recycle your servers as a general solution to latency spikes.)

Wrong again. Turns out, this happened:

A. One request was made to /Products/PurplePantaloons, which triggered a cache miss (because you don’t actually sell Purple Pantaloons) and then got stuck in some almost-never-hit database query to log a missing product id for 5 minutes.

B. Your other 10 requests to other pages on your site completed normally in 200ms on average.

Your average latency therefore was ~30 seconds!

The problem again is that averages are meant more for averaging the total quantity being measured, not the number of samples. So the average latency does not tell you how many users were affected by the problem. The bigger the outlier, the more the avg. latency number appears to reflect a huge global problem regardless of the fact that it is still, only, 1 outlier.

But we still need a global metric to measure site performance … how can we do better?

Thats right. We still wanted something to serve as a Key Performance Indicator (KPI), so we can make go/no-go decisions about site performance ... Without having to resort to tables of URLs or search through request logs looking for outliers (imagine how much later we’d be home then!).

We can do better by recognizing that we don’t really care as much about the latency itself, as much as we care about the number of users affected by performance problems on the site. We want to know how well the site is doing overall, as well as pick up on any outliers where some subset of the users may be affected even if the rest of them is fine.

Turns out there is already a metric that does it fairly well - the industry standard Apdex index. Apdex was created by Peter Sevcik of Apdex Alliance (apdex.org), and is used by several existing APM tools today.

The Apdex index is calculated like this:

ApdexT = (# Satisfied requests + # Tolerating requests / 2) / (# Total requests)

T is the selected latency threshold. Where a request is considered satisfied if its latency is < T, tolerated if it's between T and T * 4, and frustrated otherwise. So, if T is 2 seconds, a request that completes under 2 seconds leaves the user satisfied, requests that take between 2 and 8 seconds are tolerated by users, and requests greater than 8 seconds are considered too slow.

Lets look at our examples:

A. 60% of requests take 200ms, 20% take 10ms, and 20 percent are getting unacceptable latencies at 10 seconds.

Avg. latency is 2.1s (false sense of confidence), Apdex2s is 80%. Nailed it!

B. 1 request takes 5 minutes, 10 requests take 200ms.

Avg. latency is 30 seconds (ring the alarm!), Apdex2s is 91% (good).

The Apdex index does a good job of providing that top-level sense of how well your site is doing, does a good job of highlighting whenever even a small percent of your users are having slow performance, and does not allow itself to be skewed by requests with huge latencies.

Making it better

When making LeanSentry, we decided to build on Apdex as a top-level website performance metric. However, we still had the problem of different URLs having different latency requirements. We also wanted to make Apdex easier to understand for our users, and provide a pathway to dig into performance problems after the top-level score highlighted them.

Here are the improvements we made:

  1. Instead of using the Apdex index, we decided to use a slightly modified Satisfaction Score, which is essentially the Apdex index converted into a percentage for easier understanding. Also, our satisfaction score is not qualified with the T subscript, because it represents requests to multiple URLs in the application which have different T thresholds (see the next point).

  2. We classified each request into the standard Apdex categories of satisfied, tolerated, or frustrated, but based on the separately set latency threshold T of each major URL in your application (to solve the problem of different URLs having different acceptable latencies). For the Satisfaction Score of the entire site, we apply the formula over all requests, that have already been pre-categorized using their individual thresholds.

    Satisfaction breakdown for an application Track slow, failed, and sluggish requests over time

    This way, when our users log into LeanSentry, they can quickly assess the overall impact of website performance on their application. Then, they can also visualize the number of slow requests over time in relation to other events in the application.

  3. Finally, we take advantage of this concept throughout the troubleshooting process, allowing the user to drill through to see which URLs had slow or failed requests, and then diagnose the source of the slowdown on each URL.


    Track down slow or failed requests on each URL

In conclusion

You should strongly consider moving away from using average latency to metrics that help identify the true impact of slow performance on your application. The Apdex index, or our version, the Satisfaction Score, is a good way achieve that in any environment.

The more general takeaway though is, it is important to be careful when using averages to motivate assumptions about how things work, because averages can hide critical outliers. By the same token, the common approach of using isolated occurrences (e.g. entries found in the request log) to make assumptions is also bad, because the occurrences often represent isolated cases with little significance.

For LeanSentry, our answer was to do both - use the Satisfaction Score to characterize overall performance without losing the important outliers, and then track down slow requests across application URLs to diagnose problems in detail.