RAID DIsk Failed... What Now?

Long story short: I successfully replaced the drive in the array.

… but it’s almost inconceivably unlikely that (the disk) hardware was the problem!

Not only has the new drive already failed out of the array, but an identical machine running different domains also now has a failed array.

These are two nearly identical machines, side by side. The only thing that has changed lately is the amount of traffic, which has changed from practically none to 1000s of hits a day on both machines. It’s not a lot of traffic and these machines should be able to easily handle that… but BOTH machines have sent me “degrade event” emails followed by “failed drive” events.

Yes, it’s possible that three drives failed simultaneously on two separate machines. Perhaps the power is bad or the temperature is high. But I think not.

It must be related to the traffic.

I believe my drives are suffering read errors due to the amount of traffic.

How can I verify this, or cut to the chase, assume it’s true, and make my drives not suffer read errors from such puny traffic?

Thanks again,

Tony

Sorry, wanted to reply earlier, but the site was down all day.

First, you need to elaborate “does not boot”. With that description, is it impossible to give any hint. What exactly happens when you try to boot with a new disk?

You could also use a rescue CD, boot from that, and use it to resync the array with a new disk, then install the boot manager on the new drive. You need to bind-mount /dev, /proc and /sys to the directory where you mount the root of your to-be-repaired installation, and use chroot. Google should find you tutorials how to do that.

“High traffic” should, under normal circumstances, never be a cause for RAID failure. HDDs are made to transfer data, at full speed, also for longer periods of time. E.g. during resync, the whole disk is read and written at high speed. “1000s of hits per day” on a webpage is very very low traffic, concerning required disk I/O.

It is indeed unlikely that three drives fail nearly at once. You can use “smartctl” to check the SMART data of your drives, to see if there is indication of actual failure.

Otherwise, it is possible that the used HDDs are unsuitable for mdadm usage. Some drives can, under certain circumstances, produce high delay in responses to OS commands, which mdadm might interpret as failure. You might check Google if that is the case for your drives.

Thanks Locutus. I really do appreciate your incredible helpfulness.

As far as it goes, I was able to get the new drive installed and working - it is simply that it too was marked “failed” by mdadm.

The original disks are probably less than three years old. They are Maxtor Fireball 120GB drives, purchased new for the express purpose of building the machine - perhaps in 2010. The new drive was a WD Caviar 500 GB.

I’m quite certain that the problem is due somehow to my improper configuration. Because the most important thing to me is time, I think it would be best to simply ditch the RAID unless that learning curve can be effectively cut to a day or so.

If a hardware RAID can be had that will handle the drives as-is, that would be nice, but something tells me that’s an unlikely and expensive fantasy. Which leads me to wonder, is it possible to somehow create physical disks that contain the same data which is now partially RAID 1 and partially RAID 5’d across three “failing” disks?

Thanks again,

Tony

Out of curiosity, if you run the command “dmesg”, are you seeing any error output at the end of that?

If you were dealing with some sort of hardware error – it would likely be throwing some errors that were showing up in that dmesg output.

-Eric

Yes, what Eric said, and additionally you should examine the SMART data like I mentioned before.

See if the disks report any SMART data that is indicative of failure. You can also instruct your drives to perform self-tests. ATTENTION! In your situation, you should perform those self-tests only when the RAID arrays are not assembled/running, i.e. from a rescue CD! That is because the self-tests, especially the long one, can cause the drive to respond very slowly to OS commands, thus making mdadm drop the drive from the array cause it thinks it’s defective.

As for your Maxtor drives – I didn’t know that in 2010 they still sold such small drives like 120 GB? Sounds rather outdated.

I have mdadm experience with the following drives. At home, I use three WD EARS (1.5 GB) in a RAID-5 for my NAS, and at University, we have two WD RE4 (2.0 GB) in RAID-0. These work okay, all under Ubuntu 10.04.

For server purposes, I’d always suggest using RAID-1 and not RAID-5, except storage space vs. HDD price is a really big issue (which it shouldn’t be when operating a server).

Your last question I didn’t understand. Can you re-phrase “create physical disks that contain the same data which is now partially RAID 1 and partially RAID 5’d across three “failing” disks” please?

Hi, thanks. When I ran the Smart utility from Virtualmin, it did say something about “old age”. I apologize for not being more specific, I cannot get to the machine at this time and I ran the command before you guys suggested it.

As for dmesg, I will run that command when I am physically near the machine again.

Since you are saying that RAID level one is the only thing I should use, I would rather go with a nightly full disk dd command and leave it at that. Thus, I want to get rid of the raid entirely. My last question was in reference to doing just that. Is there a procedure to simply put the contents of these disks back onto a single disk, or is that wishful thinking?

I’ve had good luck restoring virtualmin sites so I have a lot more faith that rebuilding the machine will go off without huge hitches, if that is my least labor intensive path to freedom from the RAID.

For me, because of my lack of knowledge, the RAID has become more of a burden than a tool.

Thanks for your help,

Tony

Migrating an existing installation from RAID to non-RAID is – at least on Ubuntu, I suppose it’s the same for CentOS – a bit tricky, but when you know the right steps, rather simple.

What you need to do is create the proper partitions on the new drive, use “rsync” to copy over the disk contents from the RAID partitions to their non-RAID counterparts. That’s the easy bit. Then to get the boot loader on the new drive, you need to mount the root partition of the new drive somewhere reachable, then bind-mount /dev, /proc and /sys into that root mount, and use chroot to go into it. Then use grub-install and update-grub.

Here’s a website that I use for reference when doing such moves:

http://realtechtalk.com/Ubuntu_1004GRUB2_mdadm_wont_boot-1070-articles

The instructions are for non-RAID to RAID, but they work in an analog way for the other way round. Just skip the mdadm bit and install grub to just one drive.

The very last messages in dmesg report the RAID was successfully rebuilt, although the mdadm report for the RAID 0 reports “[U_]”

This SMART Report is confusing to me:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 140 139 021 Pre-fail Always - 3966
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 19
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 16
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 19
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 13
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 67
194 Temperature_Celsius 0x0022 110 104 000 Old_age Always - 33
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0

But this is the NEW drive!

Could that 33C be the culprit? Is the server room too hot?

Thanks,

Tony

33 celsius is perfectly okay for an HDD, the SMART data looks a-ok.

Could it be the fstab configuration that is causing the problem?

/dev/md1 / ext3 grpquota,usrquota,rw 0 1 /dev/md2 /tmp ext3 nosuid,noexec,nodev,rw 0 0 /dev/md0 /boot ext3 defaults 1 2 tmpfs /dev/shm tmpfs nosuid,noexec,nodev,rw 0 0 devpts /dev/pts devpts gid=5,mode=620 0 0 sysfs /sys sysfs defaults 0 0 proc /proc proc defaults 0 0 LABEL=SWAP-sdc3 swap swap defaults 0 0 LABEL=SWAP-sdb3 swap swap defaults 0 0 LABEL=SWAP-sda3 swap swap defaults 0 0

Howdy,

Your fstab shouldn’t affect the workings of a software RAID device. It’d normally be the other way around :slight_smile:

And your fstab looks pretty normal.

What does /proc/mdstat contain, out of curiosity?

-Eric

Just to close this issue. I re-added the new drive after the previous failure and haven’t received any failure notifications from that machine since.

I also noticed that the actual fail notification from the other machine was from MONTHS earlier. I must have forgotten about it and Google simply lumped it into the threaded email because of a similar subject line…

… so go figure.

It all is up and running for the last few days now.

Thanks everyone to your never ending constant striving to make life great for the rest of us. Virtualmin REALLY IS the reason I run on Linux platforms. I’m sure that’s true for thousands and thousands of people.

Thanks again,

Tony

Virtualmin REALLY IS the reason I run on Linux platforms. I’m sure that’s true for thousands and thousands of people.

Yes, actually, Virtualmin was the reason also for me to switch from Windows to Linux for my web hosting platform. :slight_smile:

I recently had a failing disk, too (software RAID) - the SMART info in webmin was actually very helpful here to get the disk replaced in time, but I do agree that webmin could provide a helping hand when integrating a replaced disk back into the system, i.e. partitioning the new disk according to existing disks, detaching removed partitions, adding the new partitions to each /dev/mdX and installing/updating the boot loader.

Also, /proc/mdstat should probably be evaluated as part of the status display in the “Linux RAID” module, because it showed “Active (green)”, despite the RAID missing several partitions …so it would probably be better to show the RAID status there, too ?

What a mess this thread turned into…

The only 3 steps you needs to do was fail the whole SCSI Drive A for all partitions and then replace that drive.

Raid would have taken care of the rest when you copied the partition structure to the new drive and then assign the new partitions to the array.

It’s a very easy thing to do.

  1. Fail sda drive and then shutdown the server and have sda replaced then boot the server backup
  2. login as root and issue this command sfdisk -d /dev/sdb | sfdisk /dev/sda
  3. Log into webmin and use the RAID module to add your new sda drive into the array.

That’s it…

The webmin RAID module is your friend and will nearly do everything you need done.