Keeping the lights on

Having spent years running 24×7 internet-facing production systems, I find that the monitoring element of an application delivery environment is often the last item to be addressed and built outside of the application delivery architecture. As we continue to build our application delivery infrastructure in the cloud, having a good monitoring strategy will allow us to arm ourselves with the information we need to make intelligent decisions.

So exactly what should be monitored?


The first element in a monitoring strategy is to determine whether the application is accessible. The most simplistic form of determining availability is ping. However, as most applications are obscured behind a load balancer, a ping response doesn’t necessarily mean that the application is responding to requests. Use a monitoring system that can speak application-layer protocols to ensure that the application is indeed healthy and responding to user requests. It’s best to leverage a 3rd party solutions that can assess availability from multiple networks and provide an unbiased view on the availability of the application.

Resource Utilization / Load

Next element in a good monitoring strategy is to determine how healthy a system is. Tracking the load of various system components will enable us to uncover bottlenecks within the application delivery environment. Leverage SNMP to capture and record utilization statistics on CPU, memory, disk IO, network IO, threads, and so on. Graph these stats to establish baseline and find correlations between each monitored element.


Performance monitoring is often the most challenging element of a monitoring strategy. Here we are concerned with how the application is performing for a given user. The most common approach is to create synthetic transactions simulating user behavior and run those transactions from different network locations. While availability & load monitoring focus on individual components within an application delivery environment. Performance monitoring delivers a holistic view on how well the individual components are working together.


The final element in our monitoring strategy is focused on security. Unfortunately system security is often an afterthought, usually dealt with AFTER an intrusion resulted in significant downtime. I urge everyone to proactively monitor system behavior changes to minimize the time to discover & rectify an intrusion. At a minimum, track file- and network-level changes. Production systems should not see changes in system binaries. Unused TCP and UDP ports should remain closed. Changes in both of those would indicate anomaly and thoroughly investigated.

So far I’ve not touched on tools that can be leveraged to monitor each of these elements. Over the years I’ve worked with both open source and commercial tools. My favorites include Big Brother, Monit, Cacti, Analog (a bit long in the tooth now), Keynote, Tripwire, and lots of Perl scripting. A complete monitoring strategy will incorporate multiple tools as I’ve yet to come across a single tool that does it all.