I have been using Virtualmin GPL for almost two years without problem, then about a month ago my nagios box would alert me that http, smtp and pop were not reachable on my virtualmin box (its behind a nat firewall) while on the lan I could still ping the virtualmin box, but could not ssh to it. The only thng I could do was a dirty shut down. This continued to happen on almost a daily basis at random times, in the messages log I could see stuff about oom-killer. I had a 4gb swap partition and 1.5Gb of Ram the server noramally uses just over 500mb, so I was puzzled why oom-killer had been invoked.
So I decided to rebuild my virtualmin box and move the domains over to get rid of the problem. The new server has 2gb of Ram and a 4gb swap partition. For almost a week the new server was running fine and I thought that the problem had gone. Yesterday Nagios alerted me that smtp,http and pop had gone down again I had to drive to the office and sure enough I could ping and get a reply but could not ssh or get into webmin.
I have attached the messages.log from before the server hung untill after it restarted. I would be really gratefull if someone could take a look at the file for me and point me in the right direction?
Well, if you’re seeing oom-killer messages, something is eating up your RAM.
Since you have a pretty decent amount of RAM there, that probably implies something unusual is going on.
One example of what the problem could be is if you were getting large bursts of web traffic. If Apache keeps launching processes, that could easily use up all all your RAM/swap.
I’d be curious to see the full oom-killer messages that are in your logs… they’ll typically contain hints as to what exactly they’re killing, so you know what’s causing the trouble.
That suggests that a PHP or Apache may be the culprit.
Since PHP has a limit as to how much memory it can take up (32MB per process by default), it’s probably not PHP itself, but too many copies of PHP being spawned.
What I’d suggest doing is reviewing your logs, and the bandwidth usage in Virtualmin, to figure out where all the traffic is coming from, and where it’s going to.
But one thing you might want to do to reduce the chances of Apache/PHP causing problems is to turn down the number of Apache instances that can be spawned at once.
To do that, you can edit your Apache config, and change “MaxClients” to something lower. It’s typically 150 by default, you might want to make it 50 or 75. Then restart Apache when you’re done.
I have edited the httpd.conf file in /etc/httpd/conf/ “MaxClients” was set to 256 so I have reduced that to 50.
Under Virtualmin/system information, I just noticed that I do not have bandwidth listed in the right pane, I clicked on configure this page and the tick was in the box to display bandwidth, but its not showing? Is there a way I can make it come back? If its any help I am running 32bit Centos 5.5
A way to see the full bandwidth listing would be to go into System Settings -> Bandwidth Monitoring, and click “Show Usage Graph”. That would give you a few different options for seeing all your usage.
And yeah, 256 for MaxClients may be a bit high… with the size of each Apache process, as well as the memory PHP takes up, that could certainly cause some trouble should all 256 become used. Lowering that should definitely help.
I thought that reducing the number of Apache instances had fixed my problem, but alas not.
As I was leaving the office last night my mobile phone recieved an alert from our nagios server, saying that SMTP had gone down. So I quickly unlocked the office and looged onto my virtualmoin box to see what was going on.
It was still responding but very slowly, all of the 4Gb swap was used and there was 48Mb free of 2Gb Ram
There were over 1000 processes runnning, so I took a look at the running processes list, most of them were for one particular user running the command “perl new.txt” there were loads of these the top 8 of them were using over 3Gb Ram/Swap!
The other commands that were running were for the same user but “php -q haugzen.txt http://various.websites”
So I disabled this site and shutdown the server then retsarted it. Since then its been fine, does anyone have an idea on what might be happening?
What you’re describing sounds like an exploit of some kind… what often happens is some sort of bot searches for older web app installations that contain some sort of security hole. They upload code, then execute it.
What I’d suggest doing is figuring out where new.txt resides, and to try and use it’s location to determine what web app was broken into. Then, make sure that it’s up to date.
It also wouldn’t be a bad idea to browse all the web apps on your server and make sure all of them are the most recent version
The account that was running all those commands was running a zen shopping cart. I have tried searching for perl new.txt and haugzen.txt in their home folder but cant find either?
I am very reluctant to enable the site in case it happens again, not sure what to do?
You would only need to search for new.txt and haugzen.txt. In addition to looking in the homedir, you may also want to look in /dev/shm, /tmp, and /var.
Well, it’s possible that the file new.txt was deleted immediately after it was run… so although it showed up in the process list, the file might not actually exist.
As far as how to clean the public_html dir… that’s no easy question… Unfortunately, the answer is to delete any files that don’t belong
The tricky part is in identifying what files don’t belong with your web apps, and also to make sure you upgrade the web apps to be at their newest versions.
So I’d suggest reviewing all the files in your public_html and related dirs, and make sure they belong. Sometimes, the timestamps on the files/dirs can help you identify what was uploaded or modified recently.
Sorry that it’s not easier, cleaning up after a breakin is a pain!