CentOS Cloudmin KVM 3TB partition - filesystem errors - the system died - research

gpetrov · March 5, 2014, 2:54pm

Hello,

I have a case of Cloudmin CentOS 6.5 with 2 KVM guests completely dead due to filesystem of both guests died.

I think it would be interesting for you to know this real life experiment.

Here is the setup: on a CentOS 6.5 (default minimal installation, just update) machine with Cloudmin GPL (KVM) (default settings) I have installed two KVM guest machines both CentOS 6.5 (again updated to the latest Kernel). Both machines use LVM volumes for virtual disks. Both machines have the following configuration:

KVM1:
root - 20 GB LVM volume (cache:none)
swap - 8GB LVM volume (cache:none)
/home - 50GB LVM volume (cache:none)

RAM allocated: 2GB

KVM2
root - 200 GB LVM volume (cache:none)
swap - 32GB LVM volume (cache:none)
/home - 3TB LVM volume (cache:none)

RAM: allocated 22GB

Both KVM guests have 8 virtual processors each:
Number of virtual CPUs: 8
Cores per socket: 8

The 3TB partition was created manually using gparted as parted can only create up to 1TB partition (so Cloudmin gives an error).

KSM (the memory deduplication daemon) was turned off on the host system.

The whole setup was working pretty smooth for few weeks then:

Firstly I discovered huge spikes on the CPU usage of the host system. Those spikes were for few seconds, starting from 1.5 up to 7 or even 8 at the moments where the system died. At the time of the spike there were no CPU usage on the guest systems. So the CPU usage spike is just on the host machine.
At some point I started getting errors on the filesystems on the KVM guests. The first few was harmless (I thought I didn’t shut them down correctly), then I started to get worse and worse. More and more errors everywhere. I get such errors in dmesg:

kernel: EXT3-fs error (device sda4): ext3_lookup: deleted inode referenced: 1679361

or EXT-4 as /home filesystems were ext4 (so it is not ext3 or ext4 specific)

If I reboot the system the errors are actually much more than reportd by the Kernel while it was running.

The more the server runs the more the errors, the more the CPU spikes. Of course the more the load. The bigger KVM guest was using to serve around 100 websites. So the more websites we add the more errors. At some point I decided to bail out and moved everything to another server.

All the time I get no errors whatsoever in the host system.

So few thoughts there, as I spent huge amount of time trying to find the problem:

The first thought was hardware problem. But this didn’t proved I have run memtest86+ no errors (it is ECC there are not even soft errors). I want to run some test on the RAID controller (please recommend some test) that will simulate real file reads and writes around the whole filesystem. But anyway, the RAID (it is MegaRaid) diagnostic tools do not report any errors.
So more likely the problem is in the KVM setup. Might this be the 3TB partition? I have found reports on VirtIO having bugs and problems with large partitions? Might this be the problem? Why the problem is common for both guests as only one have big partition. Maybe VirtIO crashes at some point (you try to access something at the end of the partition, outside the 1TB zone) then it starts destroying the filesystems on all VirtIO drives. Another interesting fact is that immediately when I shut down the guest with the 3TB drive the CPU spikes on the host disappeared.

I keep the whole system so I can run tests and debug. Please share your thoughts and ideas on what to test in order to find the problem. And keep in mind if you want to try the same configuration as it might break.

Thanks for your time!

gpetrov · March 7, 2014, 10:28am

OK, here is an update as I have run few tests:

As I said in the previous post RAM is fine I have run 17 hours memtest86+ - no errors
I have restored the guests in the same configuration (virtual disks, caching, ram, virtual processor cores). When I was rebuilding the filesystem on the 3TB partition with mkfs I saw CPU usage spike on the host system of 80! This is not normal. But no errors yet. Then I run bonnie++ on this partition. Again I get CPU spike and this time I get notices in the host system dmesg of file operations not completed in more that 120 sec and filesystem errors on the host machine! So this actually proves that whatever the problem kicks in it can destroy any filesystem whatever guest or host. I don’t know if the VirtIO driver is responsible for managing the file operations on the host system too.
I have removed everything and reinstalled Centos6.5 minimal on the machine without cloudmin and any KVM setup. Default LVM groups and partitions (200GB root and 3.7TB home). I did installed Virtualmin GPL just to have it there (though I never believed it can have something to do with this). Then I run bonnie++ on /home again with 128GB size of the files (it takes pretty long). No errors. Some CPU usage of 3-4 during the test, some pretty fine results as latency IOps etc.
What I am going to do now is to rebuild the Cloudmin KVM setup with two guests again but this time I will run the test first on the host, then on a small guest (200GB partition for example), then I will run the test on big guest (3TB partition). I will let you know on the result.

gpetrov · March 10, 2014, 4:44pm

OK, so here is what I have run and discovered so far:

I have rebuilt the KVM guests setup.

I do run bonie++ with -b key which disables the ram cache.

Running the test both on smaller (200GB) and very big (3TB) partitions makes a CPU spike of 30-50. It is probably the wait for the IO operation (both the virtual disk is with disabled cache and bonnie is running with disabled RAM cache). I don’t know if this is normal.

For the smaller partitions I don’t get any errors or warnings in dmesg. For the 3TB partition I get notices for filesystem operations took longer than 120sec. This doesn’t look normal.

I couldn’t break the filesystem though. fsck still shows its clean. Before I took at least few days to start making errors in the filesystem. I do get occasionally hrtimer: interrupt took xxxxxxns in dmesg. Can this be the root of the problem?

So far there is no proof if the problem was actually the big GPT partition. All the tests of the hardware seems fine. Any Ideas of what to test next?