Server crashed, out of memory, where to start looking?

pcm2a · August 1, 2009, 12:30am

Hey all. I have been running Virtualmin 100% trouble free for about a month now. This is running on a Ubuntu VMWare guest OS with a 2.2ghz CPU and 1 gig (out of 4gigs) of ram set aside for it.

Yesterday I set up 4 jobs. A weekly full backup, a daily incremental backup, a weekly full ftp backup, and a weekly incremental ftp backup. I ran the two full jobs and set the incremental to run at 0 and 2 am. Both jobs ran fine and emailed me the results.

Around 11:30 my whole server has crashed due to an out of memory problem. I have looked at the /var/log/kern.log and I see that the problem started around 8:30 am. This is what some of the log looks like:


Jul 31 08:27:12 kernel: [127533.320750] apache2 invoked oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0
Jul 31 08:29:01 kernel: [127533.320906] Pid: 29284, comm: apache2 Tainted: GF       2.6.24-23-server #1
Jul 31 08:30:34 kernel: [127533.320926]
Jul 31 08:30:36 kernel: [127533.320954] Call Trace:
Jul 31 08:31:17 kernel: [127533.321756]  [oom_kill_process+0xf6/0x110] oom_kill_process+0xf6/0x110
Jul 31 08:34:24 kernel: [127533.321777]  [out_of_memory+0x15e/0x270] out_of_memory+0x15e/0x270
Jul 31 08:34:26 kernel: [127533.321785]  [__alloc_pages+0x399/0x3d0] __alloc_pages+0x399/0x3d0
Jul 31 08:34:26 kernel: [127533.321795]  [__do_page_cache_readahead+0xe0/0x210] __do_page_cache_readahead+0xe0/0x210
Jul 31 08:34:26 kernel: [127533.321810]  [filemap_fault+0x28e/0x390] filemap_fault+0x28e/0x390
Jul 31 08:34:26 kernel: [127533.321817]  [__do_fault+0x6c/0x440] __do_fault+0x6c/0x440
Jul 31 08:34:26 kernel: [127533.321826]  [handle_mm_fault+0x1c1/0x800] handle_mm_fault+0x1c1/0x800
Jul 31 08:34:26 kernel: [127533.321864]  [do_page_fault+0x1ab/0x840] do_page_fault+0x1ab/0x840
Jul 31 08:34:26 kernel: [127533.321875]  [thread_return+0x3a/0x59a] thread_return+0x3a/0x59a
Jul 31 08:34:26 kernel: [127533.321888]  [error_exit+0x0/0x51] error_exit+0x0/0x51
Jul 31 08:34:26 kernel: [127533.321921]
Jul 31 08:34:26 kernel: [127533.321954] Mem-info:
Jul 31 08:34:26 kernel: [127533.321987] Node 0 DMA per-cpu:
Jul 31 08:34:26 kernel: [127533.322038] CPU    0: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
Jul 31 08:34:26 kernel: [127533.322066] Node 0 DMA32 per-cpu:
Jul 31 08:34:26 kernel: [127533.322069] CPU    0: Hot: hi:  186, btch:  31 usd: 133   Cold: hi:   62, btch:  15 usd:  14
Jul 31 08:34:26 kernel: [127533.322119] Active:233447 inactive:987 dirty:0 writeback:0 unstable:0
Jul 31 08:34:26 kernel: [127533.322120]  free:1474 slab:4439 mapped:8 pagetables:9530 bounce:0
Jul 31 08:34:26 kernel: [127533.322146] Node 0 DMA free:1944kB min:40kB low:48kB high:60kB active:0kB inactive:0kB present:10484kB pages_scanned:0 all_unreclaimable? yes
Jul 31 08:34:26 kernel: [127533.322186] lowmem_reserve[]: 0 994 994 994

What other logs can I look at to try to trace this problem? The only reason I have my eye on the backup process is because it is the only single thing I have changed, and I did it yesterday.

I’m going to leave them enabled for tonight to see if things bomb out tomorrow again.

Thanks for any pointers.

pcm2a · August 1, 2009, 1:48am

Found the answer to this second post, so clearing it out.

Eric · August 1, 2009, 1:59am

Well, it’s unclear as to why this happened, but the above suggests that Apache was killed by the oom-killer process.

Do you have any swap setup? You’d probably want a half decent amount of that available on your box.

You could also setup a cron job to email you a “ps auxw” list throughout the night so you can have a decent idea of what’s going on.

-Eric

pcm2a · August 1, 2009, 2:08am

I’ll set that cron job of ps -auxw for sure, great idea. I know there was some default swap stuff that I set up when I installed Ubuntu server, but I do not remembe the exact size I set up. How can I check that to give you the exact details?

Thanks for the quick response by the way!

Eric · August 1, 2009, 2:16am

Sure… what does the output of the command “free” show?

-Eric

pcm2a · August 1, 2009, 4:01am

Previously free showed 1 gig of actual memory and almost 1 gig of memory in use. There was about 45 megs of available memory. At first I was alarmed by this, it is what my deleted post above was about. However, I think that the memory must not be accurately reported for virtual machines.

This is free ran on my dedicated server that I am trying to move away from:
total used free shared buffers cached
Mem: 1026528 1006692 19836 0 168212 408104

See how it says 1 gig used and 19 megs free? Obviously that is not being reported correctly since that server has been online and running great for a very long time.

With that said, my vmware server access says that 81megs is being used which seems more likely.

I also increased the guest os from 1 gig of ram to 2 gigs of ram:
total used free shared buffers cached
Mem: 2062920 762672 1300248 0 27064 176292

How can I see my current swap filesystem information?

Eric · August 1, 2009, 4:12am

Well, the “free” command normally shows a few lines of output, and includes regular memory as well as swap. This is an example:


Mem:       1474776    1402460      72316          0      26400     455260
-/+ buffers/cache:     920800     553976
Swap:      2047992         68    2047924

In the case of the above, it’s showing 1402460 of 1474776 as being used; except, the second line clarifies this by subtracting the buffers and cache from that, which aren’t actually in use by any processes running on the system.

So actually, there’s 920800 of regular RAM available.

The last line is how much swap is available.

If you don’t have a line mentioning swap – can you paste in the contents of your /etc/fstab file?

-Eric

pcm2a · August 1, 2009, 5:23am

You’re right, I have the other lines I just wasn’t thinking to paste them in here

total used free shared buffers cached Mem: 2062920 894988 1167932 0 44028 244868 -/+ buffers/cache: 606092 1456828 Swap: 0 0 0

At the time I’m pasting this in here the VMWare console shows around 163mb of ram in use.

Also, here is the fstab file:

[code]# /etc/fstab: static file system information.

proc /proc proc defaults 0 0

/dev/sda1

UUID=08c85ab4-91f6-4b4e-8a78-d73bf4f2bb81 / ext3 grpquota,errors=remount-ro,usrquota,relatime,rw 0 1

/dev/sda5

UUID=8175cf43-1ce5-4df9-89c5-67d4a2e4f628 none swap sw 0 0
/dev/scd0 /media/cdrom0 udf,iso9660 user,noauto,exec,utf8 0 0
/dev/fd0 /media/floppy0 auto rw,user,noauto,exec,utf8 0 0

Beginning of the block added by the VMware software

.host:/ /mnt/hgfs vmhgfs defaults,ttl=5 0 0

End of the block added by the VMware software

#Mount backup archive
//192.168.10.25/Backups /backup/archive cifs auto,username=######,password=#####,workgroup=workgroup,rw 0 0
[/code]

pcm2a · August 1, 2009, 4:44pm

Things are still up and running. Maybe we will never know or maybe it has to do with me increasing the VM memory to 2 gigs. Currently VMConsole says 122mb is in use.

The free command is reporting this:

total used free shared buffers cached Mem: 2062920 1698448 364472 0 81020 978792 -/+ buffers/cache: 638636 1424284 Swap: 0 0 0

I’m not exactly sure how to do the calculations on these free numbers to get the real amount.

Nothing abnormal to report in /var/log/syslog.
Nothing abnormal in my ps -auxw.

I have my script running every 15 minutes and spitting the ps -auxw into a folder dated today and pruning files more than 2 days old. I’ll leave it running indefinitely until I am sure this isn’t going to happen again.

Eric · August 1, 2009, 4:47pm

Well, it does appear as if you don’t have any swap enabled; so any sort of spike in usage that used up your available RAM could have caused the problem you had.

If you had created a swap partition, I don’t see anything in your fstab that’s enabling it.

I would recommend having swap, as spikes can happen; and it’s probably just a matter of time before your 2GB there runs out

-Eric

pcm2a · August 1, 2009, 7:11pm

I’ll run gparted and take a screen shot of what the partitions look like. I’m sure I created one…I hope.

pcm2a · August 1, 2009, 8:16pm

Ok, here is a link to a gparted screen shot:
http://img31.imagefra.me/img/img31/2/8/1/pcm2a/f_jr8q6wtm_873f2a2.jpg

/dev/hda1, ext3, 77.91gig, 3,83 used
/dev/hda2 extended 2.08gig
    /dev/hda5, linux-swap, 2.08 gig.

This was set up automatically during the install of Ubuntu server. How do I get Ubuntu to start using this swap space?

Also here is a fdisk -l:

[code]sudo fdisk -l
[sudo] password for pcm2a:

Disk /dev/sda: 85.8 GB, 85899345920 bytes
255 heads, 63 sectors/track, 10443 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x0008af06

Device Boot Start End Blocks Id System
/dev/sda1 * 1 10171 81698526 83 Linux
/dev/sda2 10172 10443 2184840 5 Extended
/dev/sda5 10172 10443 2184808+ 82 Linux swap / Solaris
[/code]

pcm2a · August 1, 2009, 8:25pm

Ok how about this one. I replaced the UUID lines in my /etc/fstab with /dev/sda1 and /dev/sda5 and rebooted.

[code]#UUID=08c85ab4-91f6-4b4e-8a78-d73bf4f2bb81 / ext3 grpquota,errors=remount-ro,usrquota,relatime,rw 0 1
/dev/sda1 / ext3 grpquota,errors=remount-ro,usrquota,relatime,rw 0 1

/dev/sda5

#UUID=8175cf43-1ce5-4df9-89c5-67d4a2e4f628 none swap sw 0 0
/dev/sda5 none swap sw 0 0
/dev/scd0 /media/cdrom0 udf,iso9660 user,noauto,exec,utf8 0 0
[/code]

Now look when I run free:

total used free shared buffers cached Mem: 2062920 454648 1608272 0 4996 149740 -/+ buffers/cache: 299912 1763008 Swap: 2184800 0 2184800

I think that’s turned the swap on right? Should I increase my swap partition to 4 gigs since my current ram is at 2 gigs?

Eric · August 1, 2009, 9:03pm

Ahh, nice work, that did indeed enable about 2GB of swap.

Should you bump it to 4GB? Well, I’m inclined to think you’re okay with your current setup, but I guess you just need to keep an eye out for how much RAM/swap you’re using.

Have a good one,

-Eric

pcm2a · August 2, 2009, 3:24pm

Thanks for all the help. I went ahead and repartitioned the swap to 4 gigs to make it double the ram amount. At some point will I see the swap memory being used with the “free” program?

total used free shared buffers cached Mem: 2062920 1598724 464196 0 97696 880596 -/+ buffers/cache: 620432 1442488 Swap: 4281280 0 4281280

Why does it report 1.59gigs of memory used when there is barely anything actively running on the box? Certainly nothing taking up 1.59 gigs of ram. Is this just 1.59 gigs of ram that has been used in the past, but isn’t active right now? VMware reports around 150mb of ram being actively used.

Maybe I’m supposed to be subtracting something from the “buffers” line?

ronald · August 2, 2009, 5:34pm

Why does it report 1.59gigs of memory used
It isn’t using it. It is holding it in cache, so the programs that used it is available immediately from the RAM. Other programs can use that RAM if needed.

As long as the RAM is sufficient, it won’t use the swap. It is mainly used as burstable RAM. Temporarily used if 2 GB isn’t enough.

When you have so many things going on, that your box needs to use swap for longer periods of time, then it is wise to add more RAM to the box.

tuaris · August 8, 2009, 8:40am

I had a similar issue on a similar configuration around the same time, very weird. My problem was a couple of items. First I only had 512 MB of RAM. Secondly, I found that the hard drives I was running the Virtual Machine off of were creating a serious bottleneck. I would run out of memory, and disk swapping would become so slow that the machine would come to a halt, and sometimes crash.

I had both AWStats and Webalizer on the machine for each domain, It turns out those stats where slowing things down and using up a lot of ram when they run. They were also causing some problems with duplicate emails due to the fact that Postfix would time out and resend the message to Amavisd. By adding more RAM and moving the VM to better performance disk and interface, I was able to solve the issue.

Virtual machines will do best with at least 250 to 300MB/s of read/write throughput. Avoid RAID 5 whenever possible. Use a RAID 1 or 0+1 setup instead.