RAID DIsk Failed... What Now?

mrwilder · May 1, 2012, 11:25pm

Hi, I have a RAID setup in my machine. One of the drives failed.

There are three drives listed in the array, md0, md1, and md2.

They all have a screen that looks like the attached screenshot, except md0’s says it is mount on /boot instead of /tmp. In the second screenshot I notice the drives are different sizes and that I don’t have the faintest clue what I’m doing.

I found a site that probably tells me exactly how to fix it, but I am not bright enough to understand. Can anyone give me “how to recover your array with Webmin and Virtualmin for Dummies” version? Or perhaps tell me where to find it?

Thanks!

helpmin · May 2, 2012, 1:56am

You didn’t really provide any useful information for us to help you Not even the screenshots you mentioned are there

What distro? What kind of Raid?

mrwilder · May 2, 2012, 2:18am

Well, it’s NOT because I’m nervous. I just like to mix it up like that :^)

I’ve attached the images. This is a CentOS machine. Unfortunately, I’m confused about which kind of RAID it is because I am not clear which drive failed. To further demonstrate my confusion, the md0 is listed as a RAID level 1… but there’s only one drive!

The other two drives are listed as RAID level 5. Their sizes do not match. The machine appears to be operating fine.

I apologize for being so ignorant.

mrwilder · May 2, 2012, 5:23am

Ok - how about this… I have a “virtualBox” version of this machine I could bring up. However, I’m afraid to bring DOWN the real machine since I’m afraid it won’t come back up.

Since I don’t know how to proceed and feel quite a bit of urgency, I wonder if anyone can tell me the best course of action I might take at this time…specifically, should I shut down the real machine and bring up the virtualbox image?

Thanks,

Tony

mrwilder · May 2, 2012, 6:24am

OK, sorry for posting again. Here is where I’m at trying to work through this. I have shut down the real server and brought up a vitualbox image of it running elsewhere.

Before I shutdown the real server, I ran “cat /proc/mdstat” and got:


Personalities : [raid6] [raid5] [raid4] [raid1]
md0 : active raid1 sda1[2](F) sdb1[1] sdc1[0]
      256896 blocks [2/2] [UU]
md2 : active raid5 sdc2[2] sdb2[1] sda23

20482560 blocks level 5, 256k chunk, algorithm 2 [3/2] [_UU]
md1 : active raid5 sdc5[2] sdb5[1] sda53

198627328 blocks level 5, 256k chunk, algorithm 2 [3/2] [_UU]
unused devices:

I believe this means sda has failed. sda is an IDE disk, the other two are SATA, If I recall. I am GUESSING this is what I would need to do:

bring the physical server back up (will it ever boot again, now that I've shut it down?)


mdadm --manage /dev/md0 --fail /dev/sda1
mdadm --manage /dev/md1 --fail /dev/sda5
mdadm --manage /dev/md2 --fail /dev/sda2

shutdown (do I really have to shut down??? I'd rather not if it's possible to do this while the server's up)
replace the dead disk
boot (but will it boot with the new unformatted disk???)


fdisk -l /dev/sda
mdadm --manage /dev/md0 --add /dev/sda1

mdadm --manage /dev/md1 --add /dev/sda5

mdadm --manage /dev/md2 --add /dev/sda2

Does that seem like the right steps, all the right steps, and nothing but the right steps?

Thanks again,

Tony

Locutus · May 2, 2012, 9:48am

Some comments from my end… Not a “dummy walkthrough” though since my mdadm experience is a bit rusty. So take those as advices only please, don’t execute any commands I give without verification!

Your md0 is indeed a RAID-1, with three disks according to mdadm. Was that intentional to set it up like that? It is possible of course, having multiple mirrored disks. It seems that only sdb1 and sdc1 are active in md0, and sda2 is set up as hot spare.

Where do you see that their “sizes do not match”? Sizes of RAID member partitions MUST match (okay, the smallest one dictates the size for the array actually).

Unfortunately, your code printout of the proc/mdstat file was partially garbled due to forum bugs, it inserted a link where important information should be. You might want to check and fix that. (@Eric: Is it possible to get those forum bugs fixed? In code, no links or other stuff should be inferred.) According to the mdadm documentation, the partitions in proc/mdstat are appended their driver ordering number in square brackets, and “(F)” follows if that drive is failed.

To get details about the failed drive, you can use the commands “mdadm -E /dev/sda1” (examine, to be used with physical partitions) and “mdadm -D /dev/md0” (details, to be used with md devices). Make sure you find out there which drive actually failed.

To remove them from the array, you IMO wouldn’t need the “–fail” command, since the drive already failed, but the “–remove” one. Check “man mdadm” for details; also the -E and -D should tell you more. You’ll need “–fail” only if the defective disk is member of other md devices and has not failed there.

About sda being an IDE disk: Make sure that is really the case. Old IDE disks usually get “hdX” as device nodes and not “sdX”.

Do NOT replace the drive while the server is running, except you have a SATA controller and power connectors that are specifically meant for hot-swapping!

Whether the server will boot again, before and after you remove the defective disk, depends on whether the boot loader (GRUB?) is installed on all the RAID members. If you configured the RAID during OS installation, the installer should have done that for you, otherwise you’ll want the commands “grub-install” and “update-grub”. Check their man pages; I hope those apply to your CentOS, I’m using Ubuntu/Debian.

Also your BIOS needs to be configured to boot not only from the first HDD but from subsequent ones, in case the failed disk is the first in your system.

mrwilder · May 2, 2012, 4:19pm

It was a SATA disk.

Because I suffer from early onset hyper brain dysfunctional spasmosis, I didn’t wait around for any advice even though the backup server was already running.

Instead, before I shut the server down I ran the --fail commands.

Needless to say that now when I boot it says
“kernel panic not syncing attempted to kill init”

And stopped.

There’s no CD or DVD drive in the machine. If I install one and boot from a CENTOS install disk, can I save the array somehow?

Thanks again,

Tony

Locutus · May 2, 2012, 4:26pm

I’m not sufficiently familiar with CentOS, but in Ubuntu you can boot the install CD to a rescue shell, with mdadm loaded and active, and you can run the test commands I mentioned and should also be able to perform the disk swap and resync from there.

So basically most of what I said in my post is still valid, as long as you can do the required stuff from your install CD. Getting the boot loader back on might be a bit more complicated.

Eric · May 2, 2012, 5:20pm

Yeah, there is indeed a rescue mode on the CentOS install CD’s – you may be able to figure things out from there.

Many systems can boot from a USB drive, so you may be able to load the CentOS ISO onto the USB drive rather than having to install a CD/DVD in your server.

-Eric

mrwilder · May 3, 2012, 12:21am

Ok, before I take a shot at the “rescue mode”, could I simply take the old bad disk, put it in a Windows based machine, then use Norton Ghost to bit-copy it over to the new drive?

Ahem, then pop it in and go have a beer?

Thanks,

Tony

Locutus · May 3, 2012, 10:38am

That wouldn’t help, since you’ve been using the other two disks of the RAID-5 after the defective disk got dropped from it. Which means the disks are now out-of-sync. Even if not, the RAID information now records the defective disk as failed, and you’ll need to re-add it, no matter what.

IF the defective disk and the other two were still in sync, you could perform your Ghost copy, then force-create a new RAID-5 using the “–assume-clean” option, which skips the initial synchronization. But this only works if ONLY the array composition information got garbled, and the array itself is still fully intact and in sync. You should not do this in any other case.

So, to get the RAID-5 up and running again, the best course of action is to perform a resync through mdadm with a new HDD.

And actually, this is an intended and regular process for a RAID array, so to speak doing this is why you’re using RAID at all: To be able to replace a defective disk and re-integrate the new one into your disk set. If you start fiddling with external copies of defective disks now, you might as well stop using RAID altogether.

mrwilder · May 3, 2012, 7:56pm

Good point.

Should I leave the BAD disk in when I boot to the recovery CD, or install the new disk now, first, before I restart?

Thanks.

Tony

mrwilder · May 4, 2012, 12:10am

I’m sorry folks, I have a very good general idea of what I must do, unfortunately the devil is in the details.

May I use the “Net Install” disk? That’s all I seem to have of CentOS.

Now that I did NOT --remove the disk and instead --fail(ed) it, must I leave it in and boot to the install media, select “repair” somehow, then get into disk utils in some way, then --remove the bad one?

Then, after that, shutdown, install new disk, and use a similar procedure but add the new disk back into the array, then rebuild it… right?

Oh, and then add the grub loader to all the disks in the array… ?

mrwilder · May 4, 2012, 4:44am

Hi intelligent knowledgeable people,

After I burned disk one of the CentOS image, I went to go install the new disk and a) the cord to sda1 simply fell out into my hand and b) I thought about how much traffic has been slamming the machine.

I plugged the cord back in and ran the commands to get sda1 synced. It appears to be resyncing the RAID now.

Assuming that RAID does in fact resync, would you go ahead and change the drive? The SATA cord may very well have been knocked loose by me yesterday, and maybe heavy traffic is unlikely to cause read errors… but to me the fact that the cord was inexplicably loose and traffic has increased 1000% for two weeks may indicate the disk should be given another chance.

What would you do at this point? I already have the new disk, but remember, I don’t know anything about “Creating partitions with the original layout” or anything like that… whatever instructions I got to set it up in the first place, I learned here.

So, what would some of you much more knowing people do at this point? ASSUMING it might resync, would you keep the old disk and assume the cable was loose or traffic levels caused a read error, or just use the new disk to get rid of variables and for safety’s sake?

If you’d keep the old disk, can the new one be used as a “spare”? Where might I learn what a “spare” actually is and what that entails?

Thanks,

Tony

Locutus · May 4, 2012, 12:57pm

If mdadm resyncs the old disk without errors, you can assume that it is still working and the SATA cable was indeed the culprit. mdadm does thorough tests during resync, and if it can’t write or re-read any block, it will stop and tell you so.

So in that case you wouldn’t need to replace the disk yet.

As for duplicating the partition layout – there is a syntax to “sfdisk” that takes the partitioning of one disk and duplicates it to another. This forum post might help, otherwise Google will sure find it:

http://forum.soft32.com/linux/gentoo-howto-fdisk-input-fdisk-ftopict326825.html

This command should do it. VERIFY BEFORE EXECUTING!! Overwriting the partition table of the wrong disk will thoroughly nuke it.

# sfdisk -d /dev/sda | sfdisk /dev/sdb # Overwrites sdb's partition table with that on sda

A “spare” disk is an up-and-running HDD in a system that is registered in an array but not an active part of it. It is used to take over for a defective disk automatically; mdadm will use the spare drives to auto-resync to if an active drive fails.

mrwilder · May 5, 2012, 12:33am

Thanks Locutus.

The disk did resync ok - unfortunately, errors begain again within a few hours, so I was forced to bring back up the virtualbox and do it right.

I’ll try to install the new disk this weekend. Thanks for the tutorial and pointers.

Do spare disks need to be formatted and partitioned to the same type of layout of any particular disk? If so, wouldn’t that limit a spare disk in terms of which disk it can be called into service for?

Thanks,

Tony

Locutus · May 5, 2012, 8:04am

What errors exactly are you seeing? If you had a loose SATA cable, the controller might still be the problem and not the HDD.

As for spare disks: Yes, since you assign partitions as spare to an array and not a whole drive, they need to be partitioned like member drives before they can be used.

Well actually, that’s only half-true. In your specific case it’s true, since your existing array uses partitions. mdadm can actually also operate on whole raw drives, without making partitions, by using like /dev/sda and /dev/sdb when creating the array, as opposed to /dev/sda1. Naturally you can only have one array per set of drives then.

lp86 · May 5, 2012, 3:09pm

This is why I use hardware RAID, its more expensive, but a lot easier to figure out if something stops working. You can get the SATA PCI-X cards on ebay pretty cheap now.

Locutus · May 5, 2012, 8:45pm

I agree that a good HW RAID card is more reliable than software RAID, yet rather expensive to get a good one, and less flexible in terms of array composition. But if you can spend the money, and don’t need flexibility, HW RAID is the way to go.

mrwilder · May 6, 2012, 12:31am

Sigh - I feel so stupid. I just don’t get this. I have (apparently) removed sda from everything I see it listed in. I turn off the machine, replace the faulty drive with a new one, but then it won’t boot.

I reinstall the faulty drive, and it DOES boot, even though I’ve supposedly removed the drive from all instances I see it involved in (ie.,


mdadm --manage /dev/md0 --fail /dev/sda1
mdadm --manage /dev/md0 --remove /dev/sda1

Now why would it boot?

Conversely, if I --fail the drive, --remove it, then install grub on some other disk in the array, it does not boot. Thus, I can’t run the commands to add the new drive to the array.

And if I boot from the recovery console, the drives aren’t mounted at all!

Any hints?

Thanks