How I Over-Engineered the ASP.NET Health Monitoring Feature

ASP.NET Health Monitoring was one of the big features I worked on for the ASP.NET 2.0 release.  It was going to be the solution for troubleshooting and supporting ASP.NET apps in production … it also was one of the most misunderstood and arguably most over-engineered features we built.

Fast forward 8 years later, after releasing ASP.NET, IIS7, and building LeanSentry.  This is the story of this feature, lessons learned while building it, and a practical take on when to use/not to use Health Monitoring for monitoring your ASP.NET applications.

What is Health Monitoring?

Health Monitoring is a set of ASP.NET runtime events for monitoring health, performance, and security, and a flexible provider-based framework for routing these events to email, eventlog, SQL server, and any other destination.

ASP.NET Health Monitoring feature

I don’t cover the basics of the feature here.  See links at the end for more about it.

The origins

Scott Guthrie’s vision for ASP.NET 2.0 was very ambitious, building on the success of ASP.NET 1.1 programming model and looking to fundamentally improve virtually every aspect of the framework.

If you look at the features we shipped in 2.0, you can clearly see that.   Nearly every aspect of the framework received a huge upgrade: adding features like Membership, Role Management, Personalization, Provider model, Master pages, Code Beside, Themes, Skins, and so on.  Supportability was also one of the key pillars for ASP.NET 2.0 – Scott wanted to make ASP.NET apps much easier to monitor and troubleshoot.

Having just joined the ASP.NET team, I was the PM for the feature. At that point, all that developers/operations people had to go on were custom error pages, and the ASP.NET trace viewer (does anyone still remember this?).  If a problem happened without you watching it, you would never know it … and never get the details you needed to troubleshoot it. We had nothing that solved the monitoring, alerting, and forensic troubleshooting needs for production applications. So, in my mind Health Monitoring was to become the all-in-one answer to these problems.

One day I stopped Scott in the hallway to give him an update on Health Monitoring.  I brought up performance – how we were working on reducing the logging overhead in production environments.  Scott smiled and nodded acceptingly … until I got to the fact that errors were not going to be logged by default to improve performance.

Then, his face changed and he sternly said “We MUST log errors by default”.  It was non-negotiable … and I’ve since come to understand that decision each time I had to log into a server and find the ASP.NET request exception details in the EventLog.

How we I overengineered the feature

One morning, I came into Patrick’s office (he was the developer on Health Monitoring) with a 25 page document called Buffer Modes.  It was a painfully detailed exploration of how people reacted to application problems, and the model I was proposing for shaping event flows to accommodate these situations.

I don’t remember the document in detail, but it was basically about breaking down app problems along dimensions like severity (how bad an error or problem is), rates and bursts of events, the information delivery properties of the transport medium (e.g. text should say 1 thing vs. email can say 20), and the modality of the user responding to the event (e.g. respond immediately to fix, or review all issues in the evening, etc).

It was a sickeningly thorough analysis of the possible scenarios, and it suggested an event shaping model we would use to express all the possible scenarios for any Health Monitoring event delivery provider.  If that wasn’t enough, the night before I had written a WinForms app that graphically simulated the different event patterns, and how they would be shaped/delivered with different buffer modes.  I demoed the app to Patrick and our tester, and they were certainly impressed.

This meeting set the tone for the rest of the feature, sending us down a path of building a super-comprehensive and completely-flexible eventing framework that few people would eventually use.

Health Monitoring 8 years later

If you go on stackoverflow.com and read about people asking whether they should use Health Monitoring, ELMAH, log4net, or the Enterprise Logging Block, you will rarely find people advocating the use of Health Monitoring.  Its not because they have anything bad to say about it, in fact most of the time they don’t have anything to say – other than saying that they just ended up going with ELMAH (partially thanks to Hanselman’s endorsement), or Log4Net.

8 years of working with real people running real web applications on the Microsoft web platform, here is my bottom line take on why this happens:

The biggest problems with Health Monitoring

1. It focuses on delivering the events, not reporting on them

In my opinion, this was our single biggest downfall. We created an eventing platform, not a monitoring platform.  The single hardest challenge of monitoring is how people consume the information, and make decisions based on it.  We could have made a more readable email report, and created a reporting plugin for InetMgr that people could use to aggregate and report on application health.  As we left it, someone had to write code to make something of our events … which made the feature useless for people looking for a ready-to-go monitoring solution.

The reason why people chose ELMAH over Health Monitoring is precisely because ELMAH makes it easy to report the errors the application encounters.  Hats off to you, Atif Aziz.

2. Its too big / flexible!

Its too big as a feature, and has too much configuration for people who are skimming the web for a good monitoring solution for their ASP.NET apps.  Sure, they could read 18 pages of MSDN documentation for every aspect of HM, or they can just install ELMAH and be done with it.  In today’s age of turnkey solutions, guess what they end up doing.

3. It got broken in ASP.NET MVC!

This took me by surprise, ASP.NET MVC broke the default behavior of logging ASP.NET errors to Health Monitoring.  You can read about it and how to fix it here.  This was clearly an oversight by the MVC team, and I am even more suprised that they havent addressed it in MVC 3.0.

4. And, as always, the lack of education

ASP.NET Health Monitoring events have a lot of useful information … However, they are not logged anywhere by default and therefore practically non-discoverable.  And even if you did log them, you’d have to learn quite a bit about what they mean and how to leverage them to improve your applications.  Its probably best to just leave them off, said everyone.

The good parts about Health Monitoring

1. It works by default

(Except in ASP.NET MVC, see above).  When you deploy an ASP.NET application, you can count on going to the Event Log and finding any errors in there.  Even if you dont control the code, and if you never configured anything. Believe it or not, this is probably the single most useful feature of it (thanks, Scott!).

2. It has some interesting info you cant get elsewhere

The Health Monitoring’s web events give some deep information about ASP.NET runtime behavior.  For example, there are events fired when:

  • The ASP.NET application recycles
  • Forms authentication tickets expiration
  • Compilation failures
  • Membership logins failures
  • etc

You can still use ELMAH for errors or log4net for application logging, and add in the ASP.NET runtime events from Health Monitoring.

3. Its extensible if you care enough to do it

For those that do want to leverage the extensive features in Health Monitoring, it offers a wealth of possibilities.  You can create and raise custom events with a rich hierarchy, and create flexible rules that route different events to different providers for delivery.  You can use the built in providers, customize the event flows with buffer modes, customize the email reporting template, write your own provider, or use connector providers from any other logging frameworks. By extending your application with Health Monitoring web events, you have the flexibility to use any of the built-in delivery channels or third party logging frameworks without taking an explicit dependency on third party logging instrumentation.

Should I use Health Monitoring or something else?

This is the question many are asking.  To keep my answer short, I would suggest:

  1. Use HM if you want to log / troubleshoot issues that could benefit from ASP.NET runtime events. You could still route them to your logging framework of choice.
  2. Use HM if you want to set up a very quick email report for errors in your application.  But, you probably want to use ELMAH instead for simple error logging.Of course, this breaks down VERY QUICKLY if you have a lot of traffic/errors.  At this point, you really need a monitoring solution that can properly aggregate thousands of events and let you identify the key problems across them (instead of bombarding you with hundreds of alerts).  In fact, this is one of the key reasons why we built LeanSentry.
  3. Use HM / EventLog on any server where you dont control the code, to find the errors that the application is throwing. This is my go-to when troubleshooting in production.
  4. Use a logging framework for custom application logging.  Log4net is popular.  I would not recommend Health Monitoring here because its too much of an overkill, and forces too much structure for custom application logging.

Lessons Learned

8 years and many large software projects later, the lessons learned have often been the same:

  1. Dont overengineer your monitoring solution.  If its too complicated, you wont benefit from it.
  2. When it comes to supporting production applications, getting the data is a (small) part of the problem. By far the biggest problem is reporting this data in a way that helps you make decisions and take action on it.

Now that I am older and wiser, we are trying to solve some of these problems again, this time with LeanSentry.

 


Resources:

1. Need to learn about the Health Monitoring feature? Read this and then this.

2. Looking for a monitoring solution for production ASP.NET applications that elevates the important patterns from the noise? Try LeanSentry.

25 Comments

  1. The other big miss was how the story failed to tie together with where the platform was going – in this case Event Tracing for Windows. From the outside there seems to be a disconnect (surprise, surprise) between DevDiv and Windows. As a result sys admins have several different ways of getting to event info and even more to parse them as we look at SQL, Exchange, SharePoint, IIS/ASP.NET, and more. The one thing Microsoft needs to do better is stop re-inventing things within product groups and create a better eco-system by leveraging the platform services. During my days as a blue badge in MCS it was amazing how many times I had to tell one product group what the other was doing. Exposing some of the great things about WMI to the Windows Setup team, for example.

  2. @Ricardo, the scenario you mention is one of specialty scenarios that HM is great for. Listen for specific infrastructure event, and make decisions based on it. People have built intrusion detection systems based on HM events, etc.

  3. @Colin, I completely agree. Development is very often siloed without much integration between similar technologies, and the users have to sort it all out. Occasionally, teams building different but similar technologies meet at the end, there is an executive meeting, and then one team takes the other one over and essentially throws away their work. But, thats a conversation for another time.

    There is another problem at play here, which I havent talked about. MSFT is notoriously bad at building consumer-facing functionality (mostly because so many people use the platform and its hard to create something that fits everyone without expanding a HUGE effort). So the best bet is usually to build a platform with little user-facing features, and let the rich ecosystem take care of building on top of it. This was also the intention for HM. However, the tradeoff of course is that the monitoring experience is severely incomplete, lacking the kind of end-user reporting that ELMAH provided.

    So, in a way, this is Microsoft platform acting like a platform (and not a product). Letting the community rise up and fill out the gaps.

  4. Hashname

    Thanks for this fantastic post. I”ve also looked at HM countless times, but the amount of plumbing required to make it work turned me off. For that matter, i felt the same about WCF too, i was too scared to go anywhere near it. But recent version has improved this considerably.
    I always wondered why HM didn”t receive any improvement all these years.

  5. @hashname, its a good question. I think HM was relegated to the category of “it works, noone is complaining, it must be fine” of features. The product team was too busy focusing on things people actually asked about.

    Re: WCF, I agree. To be honest, we have used it internally a number of times and ended up regretting it most of the time. If you have the time to spend, we found building a more low level communication solution (e.g. REST over HTTP with HttpWebRequest) is almost always more reliable, and in the end actually faster because you dont spend any time working around any WCF weirdness. WCF is a classic example of overengineering that didnt focus on delivering basic developer value.

    You know what else is a great example of MSFT overengineering? WWF. Dont get me started.

  6. Thanks for the article on Health Monitoring. I too have chosen Elmah before when that”s all that was required.

    Going to take a look at Lean Sentry, thanks for sharing that.

  7. Richard

    One big problem with HM: IIS7 (integrated pipline) + Windows authentication + event buffering = ObjectDisposedException

    In II6 (and IIS7 classic pipeline), ASP.NET creates a WindowsIdentity to represent the authenticated user. That identity is never disposed, so when the buffered event is formatted, the identity is still valid.

    In IIS7 integrated pipeline, the WindowsIndentity is created by IIS and passed to ASP.NET, and it”s disposed when the request is completed. When the buffered event is formatted, the identity is no longer valid, and you get a “Safe handle has been closed” error.

    The only solution seems to be to turn off event buffering.
    http://forums.iis.net/t/1172849.aspx

  8. Terry Aney

    Mike – so I was one of the ones that floundered through making our own HM provider…finding work arounds for a few things (i.e. the MVC bug) and logging errors into our own SQL database. We also use HM for custom event logging and also page visit logging. Same as you, we probably worked too hard to accomplish some things that were probably available else were, but our shop was mostly of the opinion of ”sticking with MS and away from third party” as much as possible. Anyway, my question is about LeanSentry – I glanced at the demo and saw error logging.

    a) Where is this information gathered from? Is/would it be coming from HM events/error descriptions (I”ve added custom description information to the default ”unhandled” error)?

    b) You mentioned above that some information could ”only” be retrieved from HM…I think I had read that earlier, which was another reason we worked to get HM plugged into our system instead of a home baked or third party logging system…how is that information represented in Lean Sentry?

    c) Finally, based on your answer for a) I guess, are you meant to use HM and Lean Sentry in tandem or is LS supposed to replace HM?

    Thanks in advance.

  9. Hi Terry,

    Thanks for checking in. Interesting to hear about using HM for visit logging. I would also suggest you check out some of the new web analytics providers (E.g GA, Chartbeat, Mixpanel, etc) that provide huge analysis value add on top of anything that is reasonable to build yourself anymore.

    To answer your questions:

    (a) Does LeanSentry use HM?
    LeanSentry gets its data from a variety of sources, one being HM but also performance counters, WMI, IIS logfiles, etc. Its just not possible to get everything in one place if you want to create a hollistic picture about your apps”s health and performance.

    (b) Do we use HM events?
    We use ASP.NET health events as part of our analysis as needed. Right now, we mostly focus on errors.

    (c) What is the difference between HM and LeanSentry?
    I would say LeanSentry is a completely different tool. HM is a decent eventing framework you can use for your own purposes. LeanSentry is meant as a comprehensive monitoring and diagnostics tool that abstracts the health monitoring data sources, and focuses on delivering the critical insights to help improve application health.

    I dont want to write a whole lot more here, but definitely check out our How it works page: https://www.leansentry.com/Howitworks.

    Best,
    Mike

  10. Howard Hoffman

    We @ MorphoTrust are big fans of Health Monitoring. We use Health Monitoring to publish “server trouble” events to our tier-3 support staff – and we teach them to refer to our Log4net Logs to get additional data about the actual bad-thing-that-we-just-did. We created a couple of custom ErrorEvents and find it useful. Yes, could use more documentation.

    Going forward, are there tools that integrate Health Monitoring (today, circa .NET 4.5) with things like System Center Operations Manager? It”d be great if we could somehow use the SCOM UI to aggregate ASP.NET Health as defined by Health Monitoring.

    FWIW – I think you did a nice job, and it was very impressive when it came out back in 2005.

  11. Terry Aney

    Hi Mike,

    Thanks for the response. Reading that and ‘How It Works Link’, I still have a question that I didn’t seem to get from that information. We are possibly in position of creating a new site framework and deciding on both ‘Error/Performance Monitoring’ and also custom site events (including standard page visit logging based on currently authenticated user). I guess I still have two concerns/questions and looking for opinions/answers by someone who obviously knows this topic 🙂

    1. I’m wondering if I still need to have HM enabled to make Lean Sentry work. You didn’t specifically say ‘replace HM with Lean Sentry’ and the ‘How It Works’ talks about presenting error diagnostics, but who gathers and logs that information? Do I need HM turned on in some ‘default capacity’ to make Lean Sentry work?

    2. I guess the benefit I enjoyed after rolling out our own HM provider (as painful as that was and is to maintain in terms of knowledge transfer in an extremely small software shop) was the fact that we stored info (including additional custom info – can’t recall how much was custom versus built in) in our own database. So in addition to getting the built in emails when an error occurred from HM, which was our main ‘trigger’ to go and address something, when I needed to do some specific data mining (i.e. client called and asks when were all the times a specific user logged in, what pages did they hit, and they claimed an error happened on this day, what happened?), I could just hit the DB directly and do very specific queries to get the information I need. I’ve never spent too much time researching third party error/logging/analytics packages (did try Google a bit, but then just tried to replicate grabbing same info they gathered that seemed interesting to me) because I was always scared I would lose the flexibility and functionality we enjoyed storing and querying our own database and in addition it just seemed to add both dependency upon a third party solution and additional ‘complexity’ of requiring another account/tool comprehension (in Google Analytics case anyway) to view and diagnose the information (especially when our clients want to have a simple ‘site statistics’ report page within their site that we’ve traditionally wrote ourselves). So after that intro, I guess the question/concern is, am I being to close minded (and under-educated) on third party solutions in terms of how comprehensive and flexible they can be? I’m guessing you might suggest a third party package for error/performance diagnostics (like Lean Sentry :)) and possibly rolling my own (simplified) logging system to get all the custom information I want in my own storage and in a query-able manner?

    Thanks again.

  12. […] “8 years and many large software projects later, the lessons learned have often been the same: 1.Dont overengineer your monitoring solution. If its too complicated, you wont benefit from it. 2.When it comes to supporting production applications, getting the data is a (small) part of the problem. By far the biggest problem is reporting this data in a way that helps you make decisions and take action on it.” Source: http://mvolo.com/asp-net-health-monitoring-8-years-later/ […]

  13. @Terry,

    This response is very late, but for the benefit of others – it raises a good question. Third party monitoring solutions are good for the basics, but if you want the reporting flexibility you often want to store your data somewhere where you can query it.

    At LeanSentry, we focused on reporting on health and performance data the way we believe is right to maximize the troubleshooting value, and minimize noise. Our goal is to get you to the solution for slow requests, or errors, or memory leaks, as fast as possible – using our own experience and troubleshooting techniques.

    In the process, we eliminate complexity – including arbitrary ways to search the data.

    If you are interested in doing very customized analysis on your data, writing your own solution is still the way to go. But, sometimes you have to ask the question – why do that if you can solve the problem faster with an existing tool made specifically for this purpose.

    The answer would probably depend on many factors, and trying it both ways.

    Best,
    Mike

  14. carlos

    Good patterns and practices with Asp-net-health-monitoring,Failed Request Tracing, httpErrors, customErrors, Global.asax Application_Error, BasePage OnError with Ent.Library Logging y Exception Handling… ?

    Asp-net-health-monitoring
    mvolo.com/asp-net-health-monitoring-8-years-later/

    Failed Request Tracing
    http://www.iis.net/…/troubleshooting-failed-requests-using-tracing-in-iis

    httpErrors
    http://www.iis.net/…/how-to-use-http-detailed-errors-in-iis

    customErrors

    EMLAH o Enterprise Logging and Exception

    Global.asax Application_Error

    BasePage OnError

  15. Robert Hoffmann

    I love healthmonitoring, and have been using it since around 2004. But like said above, logging is nice, but being able to dig through/filter the logs and fix stuff is better.

    So i ended up building a backend that can filter trough all the stuff logged via the sql provider, and added some custom logging methods, and the possibility to send javascript and browser errors to the provider.

    Heck i even rewrote the backend and wanted to open it up on github, but now with cool stuff like lean sentry, and new relic which has an awesome free tier i don’t see that much use for it anymore. But if anyone wants to check it out and let me know if i should still open it up, have a peek at web-monitor.net (running on azure)

Leave a Reply

Your email address will not be published. Required fields are marked *