Unscheduled Downtime on 02/11/12

Summary:

Between around 17:00 and 18:00 UTC on the 2nd of November, it may not have been possible for some of you to access Three Rings. This was because our DNS servers went offline. We have brought backup DNS servers online to resolve the problem and have also made substantial improvements to protect Three Rings against future downtime from this source.

What is DNS?

DNS, or the ‘Domain Name System’, is like a telephone directory for the Internet. Most people access websites by typing the URL, or web address, into the address bar of their web browser, and their web browser then takes them to the website.

What’s actually happening behind the scenes is that the DNS server is translating a nice-for-humans address like ‘www.bbc.co.uk‘ into a nice-for-computers address like ‘212.58.241.131‘. Just like a phone book, the name of a website is usually a lot easier to remember than the string of numbers you need for your computer to “speak” to it!

What happened today?

Anyone hosting a website is advised to have two DNS servers, located in different places, so that if one breaks the other can keep going as a backup. Three Rings actually has four, so we should have been fine.

Unfortunately, our DNS provider’s master systems in New York seems to have been badly affected in the wake of Hurricane Sandy, and they’re not having any luck fixing the issue from any of their other offices around the world.

As a result of that problem, all four of our DNS servers began to fail around 17:00 UTC (because the DNS issue will have affected different Internet Service Providers at different times, it’s hard for us to be precise about when our clients will have been unable to access Three Rings).

That meant that even though Three Rings was still running, nobody’s computer could access the site (because even though the computers were being given the human-friendly address, there was no DNS server to translate that into a computer-friendly address they could understand).

What have we done about it?

As soon as we became aware that Three Rings was unreachable, we started an investigation into what happened, and began work to get the site back online. We also posted live updates on our Twitter feed:

We issued Tweets with updates every few minutes throughout the downtime. Click for a larger view.

Once we had determined that the problem was with our DNS providers, we fired up new DNS servers with a backup provider. Because of the way DNS works this can sometimes take a long time, but in this instance it seems to have happened quickly.

Three Rings is now accessible again, and the issue should now be resolved.

What did we learn?

We’ve learnt a couple of things today.

Firstly, even though we had twice as many DNS servers as recommended, on four different continents, it turned out that there was a single point of failure in New York. That meant that once our DNS provider was unable to fix things at their New York site, their other three sites became useless too (even if it took a one-in-a-hundred-year storm to trigger the failure, a single point of failure is still serious business!)

So, we’re upgrading again. We’ve now got eight DNS servers around the world (quadruple the recommended number!), but they’re hosted with two separate companies, so even if an event occurs to knock out one set of DNS servers, the other four will keep directing users toThree Rings.

We’ve also boosted our Watchdog service. Previously, we had a system which was checking that Three Rings was up and running once every hour, and that was emailing us in the event of a problem. That system checked for a problem based on whether or not the Watchdog could reach Three Rings itself. As a result, there was a window between our DNS servers going down and the Watchdog noticing a problem (because the Watchdog spent a little while remembering the computer-friendly address for Three Rings before it asked the DNS to remind it and discovered the DNS wasn’t responding).

From today we’re paying a little extra to make that check happen once every 15 minutes. We’ve also instructed the Watchdog to monitor all eight of the new DNS servers, so if something happens to either Three Rings or to all eight of our DNS servers at least three of our volunteers will immediately receive an email warning them of the problem (and, to be extra safe, they’ll receive one email on their Three Rings address and another on their personal or day-job emails, so  there’ll be no missing them!).

Key points:

  • We had four DNS Servers, spread around the world. They were used to tell users’ computers how to find the Three Rings website.
  • Our DNS provider had a problem. They were unable to solve this problem because to do so they needed their New York centre online and that is still affected by the fallout from the superstorm Hurricane Sandy.
  • With our DNS provider unable to fix the problem, users’ web browsers became unable to translate the human-friendly web address ’3r.org.uk’ into a computer-friendly web address. As a result,Three Rings became unavailable around 17:00 UTC (it’s hard to be precise about the time because of the way DNS works).
  • In response we brought up backup DNS servers with a different provider. This fixed the issue and most people will have found Three Rings was accessible again by 18:00 UTC.
  • In response we have added an extra four DNS servers. We now have a total of eight DNS servers from two separate providers, and there is no longer a single point of failure in New York (or anywhere else!)
  • We are also paying more for an improved Watchdog which will monitor both the Three Rings site and our DNS providers for any problems. If anything happens, several Three Rings volunteers will be alerted to the issue immediately. So not only will we be able to start fixing things immediately, but we’ll also be able to give much more accurate times if there’s ever another DNS problem!

In the meantime, we hope our clients will please accept our apologies for any inconvenience we’ve caused them, and also feel assured we’ve taken this opportunity to learn, and to improve our operations still further.

Update: 01:00 UTC, 03/11/12:

Our DNS provider are now investigating whether the outage was the result of a Denial of Service attack against all their servers, rather than a problem centred on their New York centre. (This doesn’t affect the impact of the DNS servers going down on Three Rings, nor does it make our response of doubling our DNS servers and providers any less effective).