Network outage at our colocation provider

Howdy all,

This afternoon/evening, our colocation provider had a network outage that effected two of our three servers. The effected servers happened to be the ones that host the following websites:

This pretty much covers everything that faces end users and provides all of our services that users interact with (our website, mail server, software license manager/verification, software updates, etc.). So, that sucked for a few hours today.

I was able to bring back the software repositories in a moderately functional form within an hour or so of the outage becoming complete (site was just slow and unreliable for an hour or so before we traced the problem to the colo network and got them onto the case; once they started working on it, we were completely offline, I guess due to hardware being swapped out).

We, of course, have a variety of notifications and alerts setup to prevent situations like this, or at least to minimize downtime. And, we did know about the problem quickly (we ended our weekly meeting early to deal with it), but there was no good way to resolve having our two primary world-facing servers go off-line due to a network outage.

We’re back up now, and I’ll be switching back to the fully functional version of the software repositories now that the database is back online.

Frustratingly, one of the goals with the recent server migration was to divide up our services so that any one service outage would be less dramatic. To that end, we moved DNS out to Cloudmin Services managed virtual machines (and DNS did not go offline), we moved and to their own server (but I hadn’t finished the planned addition of license database replication, so it failed when failed, regardless of being on another system, so that’s frustrating).

We’ve never had a network failure of this magnitude or duration at this provider, and we have no reason to believe it will be a common occurrence, but I will continue working on the project of dividing our services out onto their own systems, figuring out ways to insure one failure is never so far-reaching.

There are plenty of ways to approach “always-on” availability, and we know about them, of course, and both Jamie and I have even built quite large systems that do so, but the time and cost of setting such a thing up and maintaining it is a definite factor for us. As an open source project, we have far more demands on our time and monetary resources than we actually have time and resources to distribute!

That said, I’ll be researching options for keeping a backup of most of our services on a server in another data center and on a different network. I never want to lose, even if the website has to go down for a while, as it can negatively effect people installing our software and prevent licenses from being verified. As it stands, we have all three of our servers hosted in one data center to minimize cost and maximize convenience (I lived relatively nearby for a few years), but the cost (particularly to my sanity) of having such a long outage that effects nearly everything is pretty high.

For the folks effected by the outage, apologies for the inconvenience, and thanks for your patience.