server crashed and trying to figure out why

I’ve got VirtualMin running on a CentOS 5 box. This morning, the machine locked up and we had to hard reboot it. I’ve checked /var/log/messages but I just see the machine working up until 12:39 am, and then nothing until we rebooted it.

We were able to see the VirtualMin usage graphs that show the memory and swap go up to 4GB a piece in the hour proceeding the freeze. It seems like a runaway script is a likely cause.

Any suggestions on how to find out what crashed or what was using all the processes?

DonaldPlummer wrote:

I've got VirtualMin running on a CentOS 5 box. This morning, the machine locked up and we had to hard reboot it. I've checked /var/log/messages but I just see the machine working up until 12:39 am, and then nothing until we rebooted it.

We were able to see the VirtualMin usage graphs that show the memory and swap go up to 4GB a piece in the hour proceeding the freeze. It seems like a runaway script is a likely cause.

Any suggestions on how to find out what crashed or what was using all the processes?


Howdy,

Debugging crashes can be a tough one!

If you aren’t seeing anything in the logs, I’m not sure of a way to get the data you’re after. But it might be possible to set some things up for future reference.

I’m a bit fan of the tool “monit”:

http://www.tildeslash.com/monit/

You can set it up to monitor various aspects of your box, and optionally have it react to certain circumstances.

For example, you can have it watch to make sure Apache is always running. And if not, start Apache back up.

But you can go a step further and have it restart Apache if Apache ever takes up more than, say, 75% of your available memory.

Similarly, you can just have it monitor your memory or CPU as a whole. If ever your system has less than some percent of memory available (let’s say 10%), you can have it email you an alert, containing a process list (you’d have to use it’s exec option, and tell it to run ‘ps aux’ whenever a low memory condition is met).

While this seems unlikely in your case, another cause of odd lockups is power and heat issues. Those are also hard to diagnose, and regularly running the “sensors” tool from the lm_sensors package can help alert you to issues with your power supply, fans, and such. I have that running hourly from cron, and it notifies me if the output of “sensors” contains the text “ALARM”, which would happen if the fan RPM’s were too low, or the power wasn’t providing enough juice (or had too much!)

I hope that helps!
-Eric

Thanks! We’ll definitely take a look at monit.