How I Over-Engineered the ASP.NET Health Monitoring Feature
ASP.NET Health Monitoring was one of the big features I worked on for the ASP.NET 2.0 release. It was going to be the solution for troubleshooting and supporting ASP.NET apps in production … it also was one of the most misunderstood and arguably most over-engineered features we built.
Fast forward 8 years later, after releasing ASP.NET, IIS7, and building LeanSentry. This is the story of this feature, lessons learned while building it, and a practical take on when to use/not to use Health Monitoring for monitoring your ASP.NET applications.
What is Health Monitoring?
Health Monitoring is a set of ASP.NET runtime events for monitoring health, performance, and security, and a flexible provider-based framework for routing these events to email, eventlog, SQL server, and any other destination.
I don’t cover the basics of the feature here. See links at the end for more about it.
Scott Guthrie’s vision for ASP.NET 2.0 was very ambitious, building on the success of ASP.NET 1.1 programming model and looking to fundamentally improve virtually every aspect of the framework.
If you look at the features we shipped in 2.0, you can clearly see that. Nearly every aspect of the framework received a huge upgrade: adding features like Membership, Role Management, Personalization, Provider model, Master pages, Code Beside, Themes, Skins, and so on. Supportability was also one of the key pillars for ASP.NET 2.0 - Scott wanted to make ASP.NET apps much easier to monitor and troubleshoot.
Having just joined the ASP.NET team, I was the PM for the feature. At that point, all that developers/operations people had to go on were custom error pages, and the ASP.NET trace viewer (does anyone still remember this?). If a problem happened without you watching it, you would never know it … and never get the details you needed to troubleshoot it. We had nothing that solved the monitoring, alerting, and forensic troubleshooting needs for production applications. So, in my mind Health Monitoring was to become the all-in-one answer to these problems.
One day I stopped Scott in the hallway to give him an update on Health Monitoring. I brought up performance - how we were working on reducing the logging overhead in production environments. Scott smiled and nodded acceptingly … until I got to the fact that errors were not going to be logged by default to improve performance.
Then, his face changed and he sternly said “We MUST log errors by default”. It was non-negotiable … and I’ve since come to understand that decision each time I had to log into a server and find the ASP.NET request exception details in the EventLog.
we I overengineered the feature
One morning, I came into Patrick’s office (he was the developer on Health Monitoring) with a 25 page document called Buffer Modes. It was a painfully detailed exploration of how people reacted to application problems, and the model I was proposing for shaping event flows to accommodate these situations.
I don’t remember the document in detail, but it was basically about breaking down app problems along dimensions like severity (how bad an error or problem is), rates and bursts of events, the information delivery properties of the transport medium (e.g. text should say 1 thing vs. email can say 20), and the modality of the user responding to the event (e.g. respond immediately to fix, or review all issues in the evening, etc).
It was a sickeningly thorough analysis of the possible scenarios, and it suggested an event shaping model we would use to express all the possible scenarios for any Health Monitoring event delivery provider. If that wasn't enough, the night before I had written a WinForms app that graphically simulated the different event patterns, and how they would be shaped/delivered with different buffer modes. I demoed the app to Patrick and our tester, and they were certainly impressed.
This meeting set the tone for the rest of the feature, sending us down a path of building a super-comprehensive and completely-flexible eventing framework that few people would eventually use.
Health Monitoring 8 years later
If you go on stackoverflow.com and read about people asking whether they should use Health Monitoring, ELMAH, log4net, or the Enterprise Logging Block, you will rarely find people advocating the use of Health Monitoring. Its not because they have anything bad to say about it, in fact most of the time they don’t have anything to say - other than saying that they just ended up going with ELMAH (partially thanks to Hanselman’s endorsement), or Log4Net.
8 years of working with real people running real web applications on the Microsoft web platform, here is my bottom line take on why this happens:
The biggest problems with Health Monitoring
1. It focuses on delivering the events, not reporting on them
In my opinion, this was our single biggest downfall. We created an eventing platform, not a monitoring platform. The single hardest challenge of monitoring is how people consume the information, and make decisions based on it. We could have made a more readable email report, and created a reporting plugin for InetMgr that people could use to aggregate and report on application health. As we left it, someone had to write code to make something of our events … which made the feature useless for people looking for a ready-to-go monitoring solution.
The reason why people chose ELMAH over Health Monitoring is precisely because ELMAH makes it easy to report the errors the application encounters. Hats off to you, Atif Aziz.
2. Its too big / flexible!
Its too big as a feature, and has too much configuration for people who are skimming the web for a good monitoring solution for their ASP.NET apps. Sure, they could read 18 pages of MSDN documentation for every aspect of HM, or they can just install ELMAH and be done with it. In today’s age of turnkey solutions, guess what they end up doing.
3. It got broken in ASP.NET MVC!
This took me by surprise, ASP.NET MVC broke the default behavior of logging ASP.NET errors to Health Monitoring. You can read about it and how to fix it here. This was clearly an oversight by the MVC team, and I am even more suprised that they havent addressed it in MVC 3.0.
4. And, as always, the lack of education
ASP.NET Health Monitoring events have a lot of useful information … However, they are not logged anywhere by default and therefore practically non-discoverable. And even if you did log them, you’d have to learn quite a bit about what they mean and how to leverage them to improve your applications. Its probably best to just leave them off, said everyone.
The good parts about Health Monitoring
1. It works by default
(Except in ASP.NET MVC, see above). When you deploy an ASP.NET application, you can count on going to the Event Log and finding any errors in there. Even if you dont control the code, and if you never configured anything. Believe it or not, this is probably the single most useful feature of it (thanks, Scott!).
2. It has some interesting info you cant get elsewhere
The Health Monitoring’s web events give some deep information about ASP.NET runtime behavior. For example, there are events fired when:
- The ASP.NET application recycles
- Forms authentication tickets expiration
- Compilation failures
- Membership logins failures
You can still use ELMAH for errors or log4net for application logging, and add in the ASP.NET runtime events from Health Monitoring.
3. Its extensible if you care enough to do it
For those that do want to leverage the extensive features in Health Monitoring, it offers a wealth of possibilities. You can create and raise custom events with a rich hierarchy, and create flexible rules that route different events to different providers for delivery. You can use the built in providers, customize the event flows with buffer modes, customize the email reporting template, write your own provider, or use connector providers from any other logging frameworks. By extending your application with Health Monitoring web events, you have the flexibility to use any of the built-in delivery channels or third party logging frameworks without taking an explicit dependency on third party logging instrumentation.
Should I use Health Monitoring or something else?
This is the question many are asking. To keep my answer short, I would suggest:
- Use HM if you want to log / troubleshoot issues that could benefit from ASP.NET runtime events. You could still route them to your logging framework of choice.
- Use HM if you want to set up a very quick email report for errors in your application. But, you probably want to use ELMAH instead for simple error logging.Of course, this breaks down VERY QUICKLY if you have a lot of traffic/errors. At this point, you really need a monitoring solution that can properly aggregate thousands of events and let you identify the key problems across them (instead of bombarding you with hundreds of alerts). In fact, this is one of the key reasons why we built LeanSentry.
- Use HM / EventLog on any server where you dont control the code, to find the errors that the application is throwing. This is my go-to when troubleshooting in production.
- Use a logging framework for custom application logging. Log4net is popular. I would not recommend Health Monitoring here because its too much of an overkill, and forces too much structure for custom application logging.
8 years and many large software projects later, the lessons learned have often been the same:
- Dont overengineer your monitoring solution. If its too complicated, you wont benefit from it.
- When it comes to supporting production applications, getting the data is a (small) part of the problem. By far the biggest problem is reporting this data in a way that helps you make decisions and take action on it.
Now that I am older and wiser, we are trying to solve some of these problems again, this time with LeanSentry.
2. Looking for a monitoring solution for production ASP.NET applications that elevates the important patterns from the noise? Try LeanSentry.