RAID DIsk Failed... What Now?

mrwilder · May 6, 2012, 6:07am

Long story short: I successfully replaced the drive in the array.

… but it’s almost inconceivably unlikely that (the disk) hardware was the problem!

Not only has the new drive already failed out of the array, but an identical machine running different domains also now has a failed array.

These are two nearly identical machines, side by side. The only thing that has changed lately is the amount of traffic, which has changed from practically none to 1000s of hits a day on both machines. It’s not a lot of traffic and these machines should be able to easily handle that… but BOTH machines have sent me “degrade event” emails followed by “failed drive” events.

Yes, it’s possible that three drives failed simultaneously on two separate machines. Perhaps the power is bad or the temperature is high. But I think not.

It must be related to the traffic.

I believe my drives are suffering read errors due to the amount of traffic.

How can I verify this, or cut to the chase, assume it’s true, and make my drives not suffer read errors from such puny traffic?

Thanks again,

Tony

Locutus · May 6, 2012, 7:13pm

Sorry, wanted to reply earlier, but the site was down all day.

First, you need to elaborate “does not boot”. With that description, is it impossible to give any hint. What exactly happens when you try to boot with a new disk?

You could also use a rescue CD, boot from that, and use it to resync the array with a new disk, then install the boot manager on the new drive. You need to bind-mount /dev, /proc and /sys to the directory where you mount the root of your to-be-repaired installation, and use chroot. Google should find you tutorials how to do that.

“High traffic” should, under normal circumstances, never be a cause for RAID failure. HDDs are made to transfer data, at full speed, also for longer periods of time. E.g. during resync, the whole disk is read and written at high speed. “1000s of hits per day” on a webpage is very very low traffic, concerning required disk I/O.

It is indeed unlikely that three drives fail nearly at once. You can use “smartctl” to check the SMART data of your drives, to see if there is indication of actual failure.

Otherwise, it is possible that the used HDDs are unsuitable for mdadm usage. Some drives can, under certain circumstances, produce high delay in responses to OS commands, which mdadm might interpret as failure. You might check Google if that is the case for your drives.

mrwilder · May 6, 2012, 10:10pm

Thanks Locutus. I really do appreciate your incredible helpfulness.

As far as it goes, I was able to get the new drive installed and working - it is simply that it too was marked “failed” by mdadm.

The original disks are probably less than three years old. They are Maxtor Fireball 120GB drives, purchased new for the express purpose of building the machine - perhaps in 2010. The new drive was a WD Caviar 500 GB.

I’m quite certain that the problem is due somehow to my improper configuration. Because the most important thing to me is time, I think it would be best to simply ditch the RAID unless that learning curve can be effectively cut to a day or so.

If a hardware RAID can be had that will handle the drives as-is, that would be nice, but something tells me that’s an unlikely and expensive fantasy. Which leads me to wonder, is it possible to somehow create physical disks that contain the same data which is now partially RAID 1 and partially RAID 5’d across three “failing” disks?

Thanks again,

Tony

Eric · May 7, 2012, 12:34am

Out of curiosity, if you run the command “dmesg”, are you seeing any error output at the end of that?

If you were dealing with some sort of hardware error – it would likely be throwing some errors that were showing up in that dmesg output.

-Eric

Locutus · May 7, 2012, 9:25am

Yes, what Eric said, and additionally you should examine the SMART data like I mentioned before.

See if the disks report any SMART data that is indicative of failure. You can also instruct your drives to perform self-tests. ATTENTION! In your situation, you should perform those self-tests only when the RAID arrays are not assembled/running, i.e. from a rescue CD! That is because the self-tests, especially the long one, can cause the drive to respond very slowly to OS commands, thus making mdadm drop the drive from the array cause it thinks it’s defective.

As for your Maxtor drives – I didn’t know that in 2010 they still sold such small drives like 120 GB? Sounds rather outdated.

I have mdadm experience with the following drives. At home, I use three WD EARS (1.5 GB) in a RAID-5 for my NAS, and at University, we have two WD RE4 (2.0 GB) in RAID-0. These work okay, all under Ubuntu 10.04.

For server purposes, I’d always suggest using RAID-1 and not RAID-5, except storage space vs. HDD price is a really big issue (which it shouldn’t be when operating a server).

Your last question I didn’t understand. Can you re-phrase “create physical disks that contain the same data which is now partially RAID 1 and partially RAID 5’d across three “failing” disks” please?

mrwilder · May 8, 2012, 12:43am

Hi, thanks. When I ran the Smart utility from Virtualmin, it did say something about “old age”. I apologize for not being more specific, I cannot get to the machine at this time and I ran the command before you guys suggested it.

As for dmesg, I will run that command when I am physically near the machine again.

Since you are saying that RAID level one is the only thing I should use, I would rather go with a nightly full disk dd command and leave it at that. Thus, I want to get rid of the raid entirely. My last question was in reference to doing just that. Is there a procedure to simply put the contents of these disks back onto a single disk, or is that wishful thinking?

I’ve had good luck restoring virtualmin sites so I have a lot more faith that rebuilding the machine will go off without huge hitches, if that is my least labor intensive path to freedom from the RAID.

For me, because of my lack of knowledge, the RAID has become more of a burden than a tool.

Thanks for your help,

Tony

Locutus · May 8, 2012, 6:49am

Migrating an existing installation from RAID to non-RAID is – at least on Ubuntu, I suppose it’s the same for CentOS – a bit tricky, but when you know the right steps, rather simple.

What you need to do is create the proper partitions on the new drive, use “rsync” to copy over the disk contents from the RAID partitions to their non-RAID counterparts. That’s the easy bit. Then to get the boot loader on the new drive, you need to mount the root partition of the new drive somewhere reachable, then bind-mount /dev, /proc and /sys into that root mount, and use chroot to go into it. Then use grub-install and update-grub.

Here’s a website that I use for reference when doing such moves:

http://realtechtalk.com/Ubuntu_1004GRUB2_mdadm_wont_boot-1070-articles

The instructions are for non-RAID to RAID, but they work in an analog way for the other way round. Just skip the mdadm bit and install grub to just one drive.

mrwilder · May 10, 2012, 4:43pm

The very last messages in dmesg report the RAID was successfully rebuilt, although the mdadm report for the RAID 0 reports “[U_]”

This SMART Report is confusing to me:


SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0

3 Spin_Up_Time            0x0027   140   139   021    Pre-fail  Always       -       3966

4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       19

5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0

9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       16

10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0

11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0

12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       19

192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       13

193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       67

194 Temperature_Celsius     0x0022   110   104   000    Old_age   Always       -       33

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

But this is the NEW drive!

Could that 33C be the culprit? Is the server room too hot?

Thanks,

Tony

Locutus · May 10, 2012, 6:25pm

33 celsius is perfectly okay for an HDD, the SMART data looks a-ok.

mrwilder · May 12, 2012, 4:16am

Could it be the fstab configuration that is causing the problem?


/dev/md1  /  ext3  grpquota,usrquota,rw  0  1
/dev/md2                /tmp                    ext3    nosuid,noexec,nodev,rw 0 0
/dev/md0                /boot                   ext3    defaults        1 2
tmpfs                   /dev/shm                tmpfs   nosuid,noexec,nodev,rw 0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
LABEL=SWAP-sdc3         swap                    swap    defaults        0 0
LABEL=SWAP-sdb3         swap                    swap    defaults        0 0
LABEL=SWAP-sda3         swap                    swap    defaults        0 0

Eric · May 12, 2012, 1:17pm

Howdy,

Your fstab shouldn’t affect the workings of a software RAID device. It’d normally be the other way around

And your fstab looks pretty normal.

What does /proc/mdstat contain, out of curiosity?

-Eric

mrwilder · May 15, 2012, 5:41am

Just to close this issue. I re-added the new drive after the previous failure and haven’t received any failure notifications from that machine since.

I also noticed that the actual fail notification from the other machine was from MONTHS earlier. I must have forgotten about it and Google simply lumped it into the threaded email because of a similar subject line…

… so go figure.

It all is up and running for the last few days now.

Thanks everyone to your never ending constant striving to make life great for the rest of us. Virtualmin REALLY IS the reason I run on Linux platforms. I’m sure that’s true for thousands and thousands of people.

Thanks again,

Tony

Locutus · May 15, 2012, 1:57pm

Virtualmin REALLY IS the reason I run on Linux platforms. I’m sure that’s true for thousands and thousands of people.

Yes, actually, Virtualmin was the reason also for me to switch from Windows to Linux for my web hosting platform.

wocul · December 29, 2013, 11:11pm

I recently had a failing disk, too (software RAID) - the SMART info in webmin was actually very helpful here to get the disk replaced in time, but I do agree that webmin could provide a helping hand when integrating a replaced disk back into the system, i.e. partitioning the new disk according to existing disks, detaching removed partitions, adding the new partitions to each /dev/mdX and installing/updating the boot loader.

Also, /proc/mdstat should probably be evaluated as part of the status display in the “Linux RAID” module, because it showed “Active (green)”, despite the RAID missing several partitions …so it would probably be better to show the RAID status there, too ?

sgrayban · January 1, 2014, 11:18am

What a mess this thread turned into…

The only 3 steps you needs to do was fail the whole SCSI Drive A for all partitions and then replace that drive.

Raid would have taken care of the rest when you copied the partition structure to the new drive and then assign the new partitions to the array.

It’s a very easy thing to do.

Fail sda drive and then shutdown the server and have sda replaced then boot the server backup
login as root and issue this command sfdisk -d /dev/sdb | sfdisk /dev/sda
Log into webmin and use the RAID module to add your new sda drive into the array.

That’s it…

The webmin RAID module is your friend and will nearly do everything you need done.