Four servers with ssh timeout and vm crashes

Hi, this is my first post. I’m using Cloudmin for almost a year on 4 systems with 4 vms on each one. The first months, I always had problems with some vms crashing daily. I noticed that VMs became very stable from Jan-2013 aprox. 45 days without issues.

Suddenly since the last week, the problems return. For about 3 days I was receiving “system down” alerts for almost all VMs on all systems. But it was not real downs, but ssh timeout issues. The problem is that some of that alerts were real, so I was checking VMs constantly to know wich ones are really down. I changed timeout from 30 to 60 seconds trying to reduce false alarms.

The four servers have from 4 to 8 cores, from 4 to 8 GB of RAM, all with 500 GB HD raid-1. Each server have 4 VMs each one with 1 GB and CPU limit of 50% to 250%. I’m using img files for virtual disks with virtio, each one of about 10 to 50 GB. Some of them are mail servers, other web-servers. All using virtualmin. I’m trying to look for a pattern to know why some VMs are so unstable.

Today, one of the VMs, a mail-server suddenly crashed and the virtual hard drive required a fsck that tooks 1 hour. I’m worried about starting the week with all these alarms and crashes.

Please I need some advice to find the root of these problems and solve them. Thanks.