Maintenance Tonight

Eric · November 17, 2017, 1:20am

Just a heads up that we’ll be performing some maintenance today (Thursday), at 9:30pm EST / 6:30pm PST.

This could affect some services such as download.webmin.com.

The Cloudmin and Virtualmin servers should not be affected.

Maintenance should be no longer than 2 hours.

Let us know if you have any questions… thanks!

Joe · November 17, 2017, 9:08am

This is actually still ongoing (several hours later). Copying a huge 4 TB disk that is failing is taking a long time. ddrescue is really cool, though.

Fun story (while I wait for one of our servers drives to finish copying): About two years ago, I installed all new disks in all three of our servers in colo in one night. It was a long night, but I figured it’d be worth it to finally solve our performance problems for good on all of our servers (SSDs on the main server, SSHD hybrid drives on the others, while also doubling the RAM in every server to 32GB for two of them and 16GB for the small one that mostly acts as a name server and backup server). Within months of those new drives going online, we started getting SMART errors from all of the hybrid drives. Realistically, when you start seeing unrecoverable disk errors, and SMART begins to show increasing numbers of errors, it’s time to replace the drives, but I haven’t had the time or the funds to replace them all. We’ve had more and more frequent system freezes on the two affected servers, so I finally bit the bullet and bought all new drives and made the drive to Dallas to get them installed.

But, I’m still flabbergasted that every single drive I installed needs to be replaced. A couple of the drives that were originals from several years before the replacements (I think they’re 6 years old) are actually still in use as backup drives and have no SMART errors.

So, tl;dr: Seagate SSHD hybrid drives are a terrible idea. Don’t use them. We’ve had a 100% failure rate out of four drives. I don’t think they really sell them anymore, though, so I guess someone figured out they were a disaster.

noisemarine · November 17, 2017, 9:54am

I used to manage data centres for a living. When you have thousands of installed drives, it was a daily thing to swap a few out (hot swap bays). First instance would be pull/push and let it rebuild, and note it in the log. Surprising how often this worked. Subsequent failures were replaced and RMAd. The one thing I took away is never trust your storage, always have backups. I even take archive copies of my backups…

There was an old rule that when building arrays, you shouldn’t use all drives from the same batch. I don’t know if it’s an old sysadmin’s tale, but I’ve always tried to practice it. So far, so good.

Joe · November 17, 2017, 1:50pm

We’re all about backups and redundancy. We’ve not yet lost data, in all our years as a company, though we have lost numerous disks and had a variety of hardware problems. Failure is virtually guaranteed if you run enough servers long enough.

I am giving up for the moment, as I ran out of hours of night and the caffeine isn’t going to keep me awake much longer. I mostly finished one of them, but the other one still needs new disks (I didn’t pull them both at once because then we’d be without DNS and nothing would work!). I’ll come back tomorrow to do the other server.

Every time I have to make a trip like this, I think more and more it’s worth paying the large premium for cloud-based servers (like AWS, Google, etc.), just so I don’t have to pull all-nighters to rebuild servers.

And, yeah, I think these drives were a bad batch. They were all purchased at once from the same vendor. But, I’ve also read a few other folks complain about this particular drive, and they don’t seem to make them anymore, so I think there was something fundamentally broken about them. Maybe heat from having an SSD and spinning platters in the same case, I dunno. And, 1U rackmount servers get a little warm…though they’re in a cool environment, and none of our sensors have complaints about the internal temperature of the systems.

Jfro · November 20, 2017, 5:55pm

That old rule yes, but most of US has to deal with it at least 1 TIME
Because you don’t have everything in hand wen a new BOX with DISKS… is delivered

Also the POWERSURGE protection is very important (UPS,USV) if they have worked but Battery’s to old or other system failure with them could also be a cause of DATA ERRORS on more DISKS almost same time ( but these errors first showing then even weeks or months later then the real power accident was there). ( even sometimes the Battery of the raid controller could be causing…)

SUC6