Hello,
I have a case of Cloudmin CentOS 6.5 with 2 KVM guests completely dead due to filesystem of both guests died.
I think it would be interesting for you to know this real life experiment.
Here is the setup: on a CentOS 6.5 (default minimal installation, just update) machine with Cloudmin GPL (KVM) (default settings) I have installed two KVM guest machines both CentOS 6.5 (again updated to the latest Kernel). Both machines use LVM volumes for virtual disks. Both machines have the following configuration:
KVM1:
root - 20 GB LVM volume (cache:none)
swap - 8GB LVM volume (cache:none)
/home - 50GB LVM volume (cache:none)
RAM allocated: 2GB
KVM2
root - 200 GB LVM volume (cache:none)
swap - 32GB LVM volume (cache:none)
/home - 3TB LVM volume (cache:none)
RAM: allocated 22GB
Both KVM guests have 8 virtual processors each:
Number of virtual CPUs: 8
Cores per socket: 8
The 3TB partition was created manually using gparted as parted can only create up to 1TB partition (so Cloudmin gives an error).
KSM (the memory deduplication daemon) was turned off on the host system.
The whole setup was working pretty smooth for few weeks then:
-
Firstly I discovered huge spikes on the CPU usage of the host system. Those spikes were for few seconds, starting from 1.5 up to 7 or even 8 at the moments where the system died. At the time of the spike there were no CPU usage on the guest systems. So the CPU usage spike is just on the host machine.
-
At some point I started getting errors on the filesystems on the KVM guests. The first few was harmless (I thought I didn’t shut them down correctly), then I started to get worse and worse. More and more errors everywhere. I get such errors in dmesg:
kernel: EXT3-fs error (device sda4): ext3_lookup: deleted inode referenced: 1679361
or EXT-4 as /home filesystems were ext4 (so it is not ext3 or ext4 specific)
If I reboot the system the errors are actually much more than reportd by the Kernel while it was running.
The more the server runs the more the errors, the more the CPU spikes. Of course the more the load. The bigger KVM guest was using to serve around 100 websites. So the more websites we add the more errors. At some point I decided to bail out and moved everything to another server.
All the time I get no errors whatsoever in the host system.
So few thoughts there, as I spent huge amount of time trying to find the problem:
-
The first thought was hardware problem. But this didn’t proved I have run memtest86+ no errors (it is ECC there are not even soft errors). I want to run some test on the RAID controller (please recommend some test) that will simulate real file reads and writes around the whole filesystem. But anyway, the RAID (it is MegaRaid) diagnostic tools do not report any errors.
-
So more likely the problem is in the KVM setup. Might this be the 3TB partition? I have found reports on VirtIO having bugs and problems with large partitions? Might this be the problem? Why the problem is common for both guests as only one have big partition. Maybe VirtIO crashes at some point (you try to access something at the end of the partition, outside the 1TB zone) then it starts destroying the filesystems on all VirtIO drives. Another interesting fact is that immediately when I shut down the guest with the 3TB drive the CPU spikes on the host disappeared.
I keep the whole system so I can run tests and debug. Please share your thoughts and ideas on what to test in order to find the problem. And keep in mind if you want to try the same configuration as it might break.
Thanks for your time!