Today's outage of Virtualmin.com

Joe · March 20, 2018, 1:57am

Howdy all,

Today was fun, let me tell you about it!

Short story:

There was a hardware failure on the server that hosts Virtualmin.com. It took a while to figure out exactly what hardware was failed. It turned out to be the network interface. Switching that got us back online.

Long story (and why it took several hours to get back online):

Our colo has my Virtualmin.com address as the point of contact for our account. This server hosts that mail…so, after I opened a ticket, I had a conversation with a wall while they cheerfully responded to all of my questions and sent the answers to a server that was off-line. That took a couple of hours and a realtime chat to sort out. I thought I was waiting on them most of this time (they’ve historically been very responsive and quick, so I gave them the benefit of the doubt here) and they thought the issue was solved after the first reboot of the server was completed. We were both wrong. Once they began communicating via an email address that works, we were able to make progress, though slowly.
I requested a KVM, they plugged one in. The server looked completely dead to me at this point. No video. The tech poked at this for a while, trying to make it work.
So, eventually, the tech moved it around to the front VGA port (this server has two). Success! I can see the screen and the server actually looks fine…a few weird errors, but it genuinely looked OK, and the CentOS prompt was there; no errors during boot process. But, keyboard wasn’t responding.
I asked for a reboot into an older kernel, hoping it’d randomly change some behavior (either the non-working keyboard or whatever was keeping it from responding on the network). No dice.
The tech cleverly tried switching the port into which the KVM keyboard was plugged…and I had working keyboard, finally! (We’re like half a day into the story by now.)
Now that I had working keyboard and video, I could start troubleshooting. Everything looked fine. But, network was not working and could not be brought up. Tech said that while there were activity lights on the NIC port on power-on, once it booted there was no light there. He suggested swapping to another port (this server has multiple NICs). So, we did that.
After a reconfigure of the network (this is a bridge interface for a bunch of Cloudmin guests) and a couple of reboots (to make sure we’d be able to come back in the future, we are finally back online.

All in all, it took several hours from the time I was aware of the issue to the time when I’m writing this now. Uptime is a balancing act between expense, time, complexity, etc., and we’ve been running pretty lean (on money and time) these past couple of years.

Last time we had a long outage of our downloads server (software.virtualmin.com) due to hardware problems I set it up with some redundancy in multiple data centers. Database-backed sites, like virtualmin.com, are much more complex to make reliable/redundant, but I’ll be working on it this week. With the Webmin.com outage a week or so ago (that one was SourceForge.net going off-line for several days twice; we now host Webmin.com on one of our servers), I think I’m fed up with outages…so, it’s a priority to solve it once and for all, though I don’t know exactly what that will look like. Expect some maintenance notices later this week as I sort it out.

Cheers,

Joe

methownet · March 20, 2018, 3:13am

Joe,
Welcome back and thanks for the update. Troubleshooting under pressure seems to be the name of the game. You guys have helped me plenty over the years in similar circumstances and I really appreciate it.
Jeff

Diabolico · March 20, 2018, 8:48am

First of all it wasnt several hours but more likely 18-20+ hours. Second, for a hardware failure this is just too long. I’m glad that you manage to solve the problem but it took you 8-12 hours just to post anything on twitter let alone the entire service was down for so many hours. Maybe would be better to stop with colo and get dedicated servers where the hosting company would be responsible for the hardware, or buy a bunch of spare parts what will sit in the DC so their people can make a quick repair when something fails. But personally i think this isnt cost effective and its much better when the hosting company is in change of the hardware. Doesnt need to be super expensive hosting for this, even OVH will replace faulty hardware in 15-30 min and for sure we cant call them expensive host.

Anyway welcome back.

Joe · March 21, 2018, 7:11am

Sh!t happens. I’m not very happy about how it played out, either, but it is what it is. No data was lost, and the server is in good enough shape to where I can take a few days to sort out what we’re going to do next with Virtualmin.com. So, I’m just gonna be grateful for things working out in the end. As I mentioned, there will be changes.

airshock · March 22, 2018, 1:30pm

Hi Joe,

Sorry to hear of the Virtualmin.com outage but I’m glad you were able to get it resolved; hardware issues are definitely a pain to work with.

I’ve got several yars’ experience helping to dramatically increase uptime on database-backed sites hosted on a Virtualmin platform through my current worok with J&E Media Corp and the 200+ sites we host for our business clients (we’re a Web development firm), and I would be more than happy to assist you in making Virtualmin.com as a database-backed site more redundant and highly-available. My current high availability set up used for J&E customer sites is a set of five front-end Web nodes connected to a public load balancer and backed by 5 replicated MariaDB Galera servers that also replicate site files across each other via a Gluster set up. This has worked very well for us and, after I found the optimal configuration parameters for our systems, we’ve had consistent uptime since November 2017. We’ve had a few short MariaDB outages on one or two of our machines since then, but because we’re running in a clustered configuration nobody has noticed since none of our sites ever went down as other MariaDB machines were online to handle requests. As I mentioned before, all of this is powered by a Virtualmin set up on one fo our Web ndoes that drives the configuration and automation of the other machines.

Anyways, like I said I’d be happy to help in your efforts to scale Virtualmin.com and increase its uptime if you’d like some assistance the