My Server keeps hanging

willrendell · June 28, 2010, 8:52am

Hello

I have been using Virtualmin GPL for almost two years without problem, then about a month ago my nagios box would alert me that http, smtp and pop were not reachable on my virtualmin box (its behind a nat firewall) while on the lan I could still ping the virtualmin box, but could not ssh to it. The only thng I could do was a dirty shut down. This continued to happen on almost a daily basis at random times, in the messages log I could see stuff about oom-killer. I had a 4gb swap partition and 1.5Gb of Ram the server noramally uses just over 500mb, so I was puzzled why oom-killer had been invoked.

So I decided to rebuild my virtualmin box and move the domains over to get rid of the problem. The new server has 2gb of Ram and a 4gb swap partition. For almost a week the new server was running fine and I thought that the problem had gone. Yesterday Nagios alerted me that smtp,http and pop had gone down again I had to drive to the office and sure enough I could ping and get a reply but could not ssh or get into webmin.

I have attached the messages.log from before the server hung untill after it restarted. I would be really gratefull if someone could take a look at the file for me and point me in the right direction?

Thanks

Will

Eric · June 28, 2010, 2:23pm

Howdy,

Well, if you’re seeing oom-killer messages, something is eating up your RAM.

Since you have a pretty decent amount of RAM there, that probably implies something unusual is going on.

One example of what the problem could be is if you were getting large bursts of web traffic. If Apache keeps launching processes, that could easily use up all all your RAM/swap.

I’d be curious to see the full oom-killer messages that are in your logs… they’ll typically contain hints as to what exactly they’re killing, so you know what’s causing the trouble.

-Eric

willrendell · June 28, 2010, 2:29pm

Thanks for the reply, here is a section from yesterdays log I hope it helps?

Jun 27 15:13:34 ns1 named[6982]: network unreachable resolving ‘www.jvw.nl/A/IN’: 2001:828:100:2001:3::30#53
Jun 27 15:13:43 ns1 kernel: 166 pagecache pages
Jun 27 15:13:45 ns1 named[6982]: network unreachable resolving ‘ns2.netwired.be/A/IN’: 2001:6a8:3c60::be#53
Jun 27 15:13:53 ns1 kernel: Swap cache: add 5276686, delete 5276674, find 50686627/50997837, race 0+1387
Jun 27 15:13:59 ns1 named[6982]: network unreachable resolving ‘ns2.netwired.be/AAAA/IN’: 2001:6a8:3c60::be#53
Jun 27 15:14:03 ns1 kernel: Free swap = 0kB
Jun 27 15:14:07 ns1 kernel: Total swap = 4128760kB
Jun 27 15:14:07 ns1 kernel: Free swap: 0kB
Jun 27 15:14:07 ns1 named[6982]: network unreachable resolving ‘www.stamps-and-coins.nl/A/IN’: 2001:660:3005:1::1:2#53
Jun 27 15:14:44 ns1 kernel: 524272 pages of RAM
Jun 27 15:14:47 ns1 named[6982]: network unreachable resolving ‘www.robhoekstra.nl/A/IN’: 2001:500:2e::1#53
Jun 27 15:15:18 ns1 kernel: 294896 pages of HIGHMEM
Jun 27 15:16:00 ns1 kernel: 5418 reserved pages
Jun 27 15:16:04 ns1 named[6982]: network unreachable resolving ‘ns64.eukdns.com/A/IN’: 2001:503:231d::2:30#53
Jun 27 15:16:17 ns1 kernel: 116829 pages shared
Jun 27 15:16:55 ns1 kernel: 13 pages swap cached
Jun 27 15:16:55 ns1 kernel: 0 pages dirty
Jun 27 15:16:55 ns1 kernel: 0 pages writeback
Jun 27 15:17:11 ns1 kernel: 122 pages mapped
Jun 27 15:17:24 ns1 kernel: 21300 pages slab
Jun 27 15:17:24 ns1 kernel: 5891 pages pagetables
Jun 27 15:17:24 ns1 kernel: Out of memory: Killed process 29397, UID 521, (perl).
Jun 27 15:16:18 ns1 named[6982]: network unreachable resolving ‘ns64.eukdns.com/AAAA/IN’: 2001:503:231d::2:30#53
Jun 27 15:29:37 ns1 kernel: INFO: task automount:1938 blocked for more than 120 seconds.
Jun 27 15:30:01 ns1 kernel: “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
Jun 27 15:30:03 ns1 kernel: automount D 0000EC4B 2732 1938 1 1941 1937 (NOTLB)
Jun 27 15:30:03 ns1 kernel: f78b5f6c 00000086 176e2660 0000ec4b f78b5f04 00000001 f78b5f44 0000000a
Jun 27 15:30:06 ns1 kernel: f7c98550 176e6730 0000ec4b 000040d0 00000001 f7c9865c c20134c4 c23c8900
Jun 27 15:30:44 ns1 kernel: 00000000 f7c98550 c2013e64 c200c680 00000020 00000000 c20134c4 00000286
Jun 27 15:30:44 ns1 kernel: Call Trace:
Jun 27 15:30:44 ns1 kernel: [] wake_up_new_task+0x20b/0x213
Jun 27 15:30:44 ns1 kernel: [] rwsem_down_write_failed+0x126/0x141
Jun 27 15:30:44 ns1 kernel: [] .text.lock.rwsem+0x2b/0x3a
Jun 27 15:30:44 ns1 kernel: [] sys_mmap2+0x44/0xa3
Jun 27 15:30:44 ns1 kernel: [] syscall_call+0x7/0xb
Jun 27 15:30:44 ns1 kernel: =======================
Jun 27 16:13:56 ns1 kernel: php invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
Jun 27 16:13:56 ns1 kernel: [] out_of_memory+0x72/0x1a3
Jun 27 16:13:56 ns1 kernel: [] __alloc_pages+0x24e/0x2cf
Jun 27 16:13:56 ns1 kernel: [] __do_page_cache_readahead+0xc4/0x183
Jun 27 16:13:56 ns1 kernel: [] filemap_nopage+0x157/0x34a
Jun 27 16:13:56 ns1 kernel: [] __handle_mm_fault+0x178/0xa25
Jun 27 16:14:46 ns1 kernel: [] do_page_fault+0x23a/0x52d
Jun 27 16:26:58 ns1 kernel: [] do_page_fault+0x0/0x52d
Jun 27 16:29:29 ns1 kernel: [] error_code+0x39/0x40
Jun 27 16:31:49 ns1 kernel: =======================
Jun 27 16:35:25 ns1 kernel: Mem-info:
Jun 27 16:35:25 ns1 kernel: DMA per-cpu:
Jun 27 16:35:25 ns1 kernel: cpu 0 hot: high 0, batch 1 used:0
Jun 27 16:35:25 ns1 kernel: cpu 0 cold: high 0, batch 1 used:0
Jun 27 16:34:42 ns1 named[6982]: network unreachable resolving ‘sns-pb.isc.org/A/IN’: 2001:500:48::1#53
Jun 27 16:35:25 ns1 kernel: cpu 1 hot: high 0, batch 1 used:0
Jun 27 16:36:05 ns1 kernel: cpu 1 cold: high 0, batch 1 used:0
Jun 27 16:36:05 ns1 kernel: DMA32 per-cpu: empty
Jun 27 16:36:05 ns1 kernel: Normal per-cpu:
Jun 27 16:36:05 ns1 kernel: cpu 0 hot: high 186, batch 31 used:58
Jun 27 16:36:05 ns1 kernel: cpu 0 cold: high 62, batch 15 used:55
Jun 27 16:36:05 ns1 kernel: cpu 1 hot: high 186, batch 31 used:5
Jun 27 16:36:05 ns1 kernel: cpu 1 cold: high 62, batch 15 used:21
Jun 27 16:36:05 ns1 kernel: HighMem per-cpu:
Jun 27 16:35:27 ns1 named[6982]: network unreachable resolving ‘sns-pb.isc.org/AAAA/IN’: 2001:500:48::1#53
Jun 27 16:36:05 ns1 kernel: cpu 0 hot: high 186, batch 31 used:15
Jun 27 16:37:59 ns1 kernel: cpu 0 cold: high 62, batch 15 used:61
Jun 27 16:37:59 ns1 kernel: cpu 1 hot: high 186, batch 31 used:28
Jun 27 16:37:59 ns1 kernel: cpu 1 cold: high 62, batch 15 used:57
Jun 27 16:38:30 ns1 kernel: Free pages: 49272kB (464kB HighMem)
Jun 27 16:38:30 ns1 kernel: Active:233140 inactive:229305 dirty:0 writeback:0 unstable:0 free:12318 slab:15570 mapped-file:92 mapped-anon:478778 pagetables:9415
Jun 27 16:38:30 ns1 kernel: DMA free:8192kB min:68kB low:84kB high:100kB active:2276kB inactive:1424kB present:16384kB pages_scanned:5653 all_unreclaimable? yes
Jun 27 16:38:30 ns1 kernel: lowmem_reserve[]: 0 0 880 2031
Jun 27 16:38:30 ns1 kernel: DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
Jun 27 16:38:30 ns1 kernel: lowmem_reserve[]: 0 0 880 2031
Jun 27 16:38:30 ns1 kernel: Normal free:40616kB min:3756kB low:4692kB high:5632kB active:430904kB inactive:333600kB present:901120kB pages_scanned:4028418 all_unreclaimable? yes
Jun 27 16:41:42 ns1 named[6982]: network unreachable resolving ‘e.gtld-servers.net/AAAA/IN’: 2001:503:a83e::2:31#53
Jun 27 16:41:45 ns1 kernel: lowmem_reserve[]: 0 0 0 9215
Jun 27 16:43:08 ns1 kernel: HighMem free:464kB min:512kB low:1740kB high:2972kB active:499252kB inactive:582324kB present:1179584kB pages_scanned:5505978 all_unreclaimable? yes
Jun 27 16:43:08 ns1 kernel: lowmem_reserve[]: 0 0 0 0
Jun 27 16:43:11 ns1 kernel: DMA: 04kB 08kB 016kB 032kB 064kB 0128kB 0256kB 0512kB 01024kB 22048kB 14096kB = 8192kB
Jun 27 16:43:11 ns1 kernel: DMA32: empty
Jun 27 16:43:11 ns1 kernel: Normal: 6364kB 3358kB 142816kB 16032kB 464kB 2128kB 1256kB 1512kB 01024kB 12048kB 14096kB = 40616kB
Jun 27 16:43:11 ns1 kernel: HighMem: 164kB 28kB 616kB 132kB 064kB 0128kB 1256kB 0512kB 01024kB 02048kB 0*4096kB = 464kB
Jun 27 16:43:13 ns1 named[6982]: network unreachable resolving ‘j.gtld-servers.net/AAAA/IN’: 2001:503:a83e::2:31#53
Jun 27 16:43:17 ns1 kernel: 253 pagecache pages
Jun 27 16:45:04 ns1 named[6982]: network unreachable resolving ‘ns2.demon.net/A/IN’: 2001:500:2f::f#53
Jun 27 16:45:23 ns1 kernel: Swap cache: add 6807376, delete 6807285, find 54729767/55185409, race 23+5120
Jun 27 16:49:22 ns1 named[6982]: network unreachable resolving ‘www.suprasport.nl/AAAA/IN’: 2a00:d78:0:102:193:176:144:2#53
Jun 27 16:49:29 ns1 kernel: Free swap = 0kB
Jun 27 16:49:49 ns1 kernel: Total swap = 4128760kB
Jun 27 16:49:56 ns1 kernel: Free swap: 0kB
Jun 27 16:49:52 ns1 named[6982]: network unreachable resolving ‘ns2.demon.net/AAAA/IN’: 2001:500:2f::f#53
Jun 27 16:49:56 ns1 kernel: 524272 pages of RAM
Jun 27 17:24:03 ns1 syslogd 1.4.1: restart.

Eric · June 28, 2010, 2:42pm

Okay, an interesting thing I see there is this:

kernel: php invoked oom-killer

That suggests that a PHP or Apache may be the culprit.

Since PHP has a limit as to how much memory it can take up (32MB per process by default), it’s probably not PHP itself, but too many copies of PHP being spawned.

What I’d suggest doing is reviewing your logs, and the bandwidth usage in Virtualmin, to figure out where all the traffic is coming from, and where it’s going to.

But one thing you might want to do to reduce the chances of Apache/PHP causing problems is to turn down the number of Apache instances that can be spawned at once.

To do that, you can edit your Apache config, and change “MaxClients” to something lower. It’s typically 150 by default, you might want to make it 50 or 75. Then restart Apache when you’re done.

-Eric

willrendell · June 28, 2010, 2:44pm

Thanks I will do that now and let you know

Will

willrendell · June 28, 2010, 3:27pm

I have edited the httpd.conf file in /etc/httpd/conf/ “MaxClients” was set to 256 so I have reduced that to 50.

Under Virtualmin/system information, I just noticed that I do not have bandwidth listed in the right pane, I clicked on configure this page and the tick was in the box to display bandwidth, but its not showing? Is there a way I can make it come back? If its any help I am running 32bit Centos 5.5

Thanks

Will

Eric · June 28, 2010, 4:05pm

A way to see the full bandwidth listing would be to go into System Settings -> Bandwidth Monitoring, and click “Show Usage Graph”. That would give you a few different options for seeing all your usage.

And yeah, 256 for MaxClients may be a bit high… with the size of each Apache process, as well as the memory PHP takes up, that could certainly cause some trouble should all 256 become used. Lowering that should definitely help.

-Eric

willrendell · July 3, 2010, 1:21pm

I thought that reducing the number of Apache instances had fixed my problem, but alas not.

As I was leaving the office last night my mobile phone recieved an alert from our nagios server, saying that SMTP had gone down. So I quickly unlocked the office and looged onto my virtualmoin box to see what was going on.

It was still responding but very slowly, all of the 4Gb swap was used and there was 48Mb free of 2Gb Ram

There were over 1000 processes runnning, so I took a look at the running processes list, most of them were for one particular user running the command “perl new.txt” there were loads of these the top 8 of them were using over 3Gb Ram/Swap!

The other commands that were running were for the same user but “php -q haugzen.txt http://various.websites”

So I disabled this site and shutdown the server then retsarted it. Since then its been fine, does anyone have an idea on what might be happening?

Thanks

Will

Eric · July 3, 2010, 1:26pm

Howdy,

What you’re describing sounds like an exploit of some kind… what often happens is some sort of bot searches for older web app installations that contain some sort of security hole. They upload code, then execute it.

What I’d suggest doing is figuring out where new.txt resides, and to try and use it’s location to determine what web app was broken into. Then, make sure that it’s up to date.

It also wouldn’t be a bad idea to browse all the web apps on your server and make sure all of them are the most recent version

-Eric

willrendell · July 3, 2010, 2:02pm

Thanks for your quick reply

The account that was running all those commands was running a zen shopping cart. I have tried searching for perl new.txt and haugzen.txt in their home folder but cant find either?

I am very reluctant to enable the site in case it happens again, not sure what to do?

Will

Eric · July 3, 2010, 2:25pm

You would only need to search for new.txt and haugzen.txt. In addition to looking in the homedir, you may also want to look in /dev/shm, /tmp, and /var.

-Eric

willrendell · July 3, 2010, 4:17pm

I have found haugzen.txt and 19 other files owned by the problem account in /tmp but cant find new.txt in any of the locatons you suggeseted.

Do you think I should delete the files in /tmp and what would be the best way to reset the public_html dir to a clean condition?

Thanks

Will

Eric · July 3, 2010, 8:06pm

Well, it’s possible that the file new.txt was deleted immediately after it was run… so although it showed up in the process list, the file might not actually exist.

As far as how to clean the public_html dir… that’s no easy question… Unfortunately, the answer is to delete any files that don’t belong

The tricky part is in identifying what files don’t belong with your web apps, and also to make sure you upgrade the web apps to be at their newest versions.

So I’d suggest reviewing all the files in your public_html and related dirs, and make sure they belong. Sometimes, the timestamps on the files/dirs can help you identify what was uploaded or modified recently.

Sorry that it’s not easier, cleaning up after a breakin is a pain!

-Eric