How I Learned to Stop Worrying and Start Loving – 1 Minute Monitoring

by | Hosting

Years ago I wrote a short article on a previous business website about how 99.9% uptime is all hype and a promise like 99.7% is more realistic, but at the same time quite easily achievable. That was at a time when you rented a server in a data centre and the people working there would have to go to their warehouse or the next computer shop and get a power supply to replace a broken one. I suggested that this would take 6 hours including notifying and diagnostics and planned for this to happen twice a year, then doubled the outage time to be safe.

No alt text provided for this image

(99.7% uptime over a year(!!!))

At the time when I cam up with this, most monitoring companies offered 30 minute monitoring for a reasonable price – a couple of bucks per month per page monitored.

Enter the missing outages

Of course, if an external provider is going to your website or web application every 30 minutes, they will miss some outages. Say, they are on a schedule to visit your site 5 and 35 minutes after every hour. That means they will notice an issue on your website that lasted from 9:00am to 9:10am. But they will miss an outage happening between 9:10am and 9:20am.

Outages are normal, we experience them, we analyse what happened and do something about it. And they happen less often. (Of course we do also make other changes to websites that make them happen more often, but thats a different story).

What I mistakenly believed at some point is that cumulative outage time would increase the more I reduce the time interval of monitoring. So I expected reducing the monitoring interval to 15 minutes would surface further outages that I did not know about yet.

I missed basic maths

What I didn’t realise for quite a while was that reported outage time is also affected by the monitoring interval. Say, I have the above mentioned outage between 9:00 and 9:10am. Of course I wouldn’t actually know when that outage was exactly, all I would know it was not before 9:05am and not after 9:35am. So, really, all I can report is 30 minutes outage. Once in a year that would be 99.994% uptime, over a month 99.93% uptime. However the same outage with a 15 minute time interval has double the uptime, because I can now report a 10 minute outage as 15 minutes.

As a result of this realisation – and thanks to more cost effective monitoring providers – we have switched all monitoring to 1 minute intervals about 2 years ago. And this has resulted in many beneficial outcomes:

  • It fits with our value of transparency. It is simpler much closer to reality than a 30 or 15 minute reporting interval
  • Uptime is actually improved, because outages are generally quite short
  • It drives reliability efforts further, because it does surface issues that are not visible on longer monitoring intervals with a small group of sites. One example are unreliable DNS hosting providers that cause site outages we cannot control.

Summary

There’s no going back to slower monitoring intervals and there’s no need to either.