Howdy all,
Today was fun, let me tell you about it!
Short story:
- There was a hardware failure on the server that hosts Virtualmin.com. It took a while to figure out exactly what hardware was failed. It turned out to be the network interface. Switching that got us back online.
Long story (and why it took several hours to get back online):
- Our colo has my Virtualmin.com address as the point of contact for our account. This server hosts that mail…so, after I opened a ticket, I had a conversation with a wall while they cheerfully responded to all of my questions and sent the answers to a server that was off-line. That took a couple of hours and a realtime chat to sort out. I thought I was waiting on them most of this time (they’ve historically been very responsive and quick, so I gave them the benefit of the doubt here) and they thought the issue was solved after the first reboot of the server was completed. We were both wrong. Once they began communicating via an email address that works, we were able to make progress, though slowly.
- I requested a KVM, they plugged one in. The server looked completely dead to me at this point. No video. The tech poked at this for a while, trying to make it work.
- So, eventually, the tech moved it around to the front VGA port (this server has two). Success! I can see the screen and the server actually looks fine…a few weird errors, but it genuinely looked OK, and the CentOS prompt was there; no errors during boot process. But, keyboard wasn’t responding.
- I asked for a reboot into an older kernel, hoping it’d randomly change some behavior (either the non-working keyboard or whatever was keeping it from responding on the network). No dice.
- The tech cleverly tried switching the port into which the KVM keyboard was plugged…and I had working keyboard, finally! (We’re like half a day into the story by now.)
- Now that I had working keyboard and video, I could start troubleshooting. Everything looked fine. But, network was not working and could not be brought up. Tech said that while there were activity lights on the NIC port on power-on, once it booted there was no light there. He suggested swapping to another port (this server has multiple NICs). So, we did that.
- After a reconfigure of the network (this is a bridge interface for a bunch of Cloudmin guests) and a couple of reboots (to make sure we’d be able to come back in the future, we are finally back online.
All in all, it took several hours from the time I was aware of the issue to the time when I’m writing this now. Uptime is a balancing act between expense, time, complexity, etc., and we’ve been running pretty lean (on money and time) these past couple of years.
Last time we had a long outage of our downloads server (software.virtualmin.com) due to hardware problems I set it up with some redundancy in multiple data centers. Database-backed sites, like virtualmin.com, are much more complex to make reliable/redundant, but I’ll be working on it this week. With the Webmin.com outage a week or so ago (that one was SourceForge.net going off-line for several days twice; we now host Webmin.com on one of our servers), I think I’m fed up with outages…so, it’s a priority to solve it once and for all, though I don’t know exactly what that will look like. Expect some maintenance notices later this week as I sort it out.
Cheers,
Joe