Linux oom situation (32 bit kernel)

Solution 1:

A 'sledgehammer' approach though would be to upgrade to a 64bit O/S (this is 32bit) because the layout of the zones is done differently.

OK, so here I will attempt to answer why you have experienced an OOM here. There are a number of factors at play here.

  • The order size of the request and how the kernel treats certain order sizes.
  • The zone being selected.
  • The watermarks this zone uses.
  • Fragmentation in the zone.

If you look at the OOM itself, there is clearly lots of free memory available but OOM-killer was invoked? Why?


The order size of the request and how the kernel treats certain order sizes

The kernel allocates memory by order. An 'order' is a region of contiguous RAM which must be satisfied for the request to work. Orders are arranged by orders of magnitude (thus the name order) using the algorithm 2^(ORDER + 12). So, order 0 is 4096, order 1 is 8192, order 2 is 16384 so on and so forth.

The kernel has a hard coded value of what is considers a 'high order' (> PAGE_ALLOC_COSTLY_ORDER). This is order 4 and above (64kb or above is a high order).

High orders are satisfied for page allocations differently from low orders. A high order allocation if it fails to grab the memory, on modern kernels will.

  • Try to run memory the compaction routine to defragment the memory.
  • Never call OOM-killer to satisfy the request.

Your order size is listed here

Dec 27 09:19:05 2013 kernel: : [277622.359064] squid invoked oom-killer: gfp_mask=0x42d0, order=3, oom_score_adj=0

Order 3 is the highest of the low-order requests and (as you see) invokes OOM-killer in an attempt to satisfy it.

Note that most userspace allocations don't use high-order requests. Typically its the kernel that requires contiguous regions of memory. An exception to this may be when userspace is using hugepages - but that isn't the case here.

In your case the order 3 allocation is called by the kernel wanting to queue a packet into the network stack - requiring a 32kb allocation to do so.

The zone being selected.

The kernel divides your memory regions into zones. This chopping up is done because on x86 certain regions of memory are only addressable by certain hardware. Older hardware may only be able to address memory in the 'DMA' zone for example. When we want to allocate some memory, first a zone is chosen and only the free memory from this zone is accounted for when making an allocation decision.

Whilst I'm not completely up to knowledge on the zone selection algorithm, the typical use-case is never to allocate from DMA, but to usually select the lowest addressable zone that could satisfy the request.

Lots of zone information is spat out during OOM which can also be gleaned from /proc/zoneinfo.

Dec 27 09:19:05 2013 kernel: : [277622.359382] DMA free:2332kB min:36kB low:44kB high:52kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15968kB managed:6960kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:8kB slab_unreclaimable:288kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Dec 27 09:19:05 2013 kernel: : [277622.359393] Normal free:114488kB min:3044kB low:3804kB high:4564kB active_anon:0kB inactive_anon:0kB active_file:252kB inactive_file:256kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:894968kB managed:587540kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:117712kB slab_unreclaimable:138616kB kernel_stack:11976kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:982 all_unreclaimable? yes
Dec 27 09:19:05 2013 kernel: : [277622.359404] HighMem free:27530668kB min:512kB low:48272kB high:96036kB active_anon:2634060kB inactive_anon:217596kB active_file:4688452kB inactive_file:1294168kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:36828872kB managed:36828872kB mlocked:0kB dirty:0kB writeback:0kB mapped:183132kB shmem:39400kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:430856kB unstable:0kB bounce:367564104kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

The zones you have, DMA, Normal and HighMem indicate a 32-bit platform, because the HighMem zone is non-existent on 64bit. Also on 64bit systems Normal is mapped to 4GB and beyond whereas on 32bit it maps up to 896Mb (although, in your case the kernel reports only managing a smaller portion than this:- managed:587540kB.)

Its possible to tell where this allocation came from by looking at the first line again, gfp_mask=0x42d0 tells us what type of allocation was done. The last byte (0) tells us that this is a allocation from the normal zone. The gfp meanings are located in include/linux/gfp.h.

The watermarks this zone uses.

When memory is low, actions to reclaim it are specified by the watermark. They show up here: min:3044kB low:3804kB high:4564kB. If free memory reaches 'low', then swapping will occur until we pass the 'high' threshold. If memory reaches 'min', we need to kill stuff in order to free up memory via the OOM-killer.

Fragmentation in the zone.

In order to see whether a request for a specific order of memory can be satisfied, the kernel accounts for how many free pages and available of each order. This is readable in /proc/buddyinfo. OOM-killer reports additionally spit out the buddyinfo too as seen here:

Normal: 5360*4kB (UEM) 3667*8kB (UEM) 3964*16kB (UEMR) 13*32kB (MR) 0*64kB 1*128kB (R) 1*256kB (R) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 115000kB

For a memory allocation to be satisfied there must be free memory available in the order size requested or a higher allocation. Having lots and lots of free data in the low orders and none in the higher orders means your memory is fragmented. If you get a very high order allocation its possible (even with lots of free memory) for it to not be satisfied due to there being no high-order pages available. The kernel can defragment memory (this is called memory compaction) by moving lots of low order pages around so they don't leave gaps in the addressable ram space.

OOM-killer was invoked? Why?

So, if we take these things into account, we can say the following;

  • A 32kB contiguous allocation was attempted. From the normal zone.
  • There was enough free memory in the zone selected.
  • There was order 3, 5 and 6 memory available 13*32kB (MR) 1*128kB (R) 1*256kB (R)

So, if there was free memory, other orders could satisfy the request. what happened?

Well, there is more to allocating from an order than just checking the amount of free memory available for that order or higher. The kernel effectively subtracts memory from all lower orders from the total free line and then performs the min watermark check on what is left.

What happens in your case is to check our free memory for that zone we must do.

115000 - (5360*4) - (3667*8) - (3964*16) = 800

This amount of free memory is checked against the min watermark, which is 3044. Thus, technically speaking -- you have no free memory left to do the allocation you requested. And this is why you invoked OOM-killer.


Fixing

There are two fixes. Upgrading to 64bit changes your zone partitioning such that 'Normal' is 4GB up to 36GB, so you wont end up 'defaulting' your memory allocation into a zone which can get so heavily fragmented. Its not that you have more addressable memory that fixes this problem (because you're using PAE already), merely that the zone you select from has more addressable memory.

The second way (which I have never tested) is to try to get the kernel to more aggressively compact your memory.

If you change the value of vm.extfrag_threshold from 500 to 100, its more likely to compact memory in an attempt to honour a high-order allocation. Although, I have never messed with this value before - it will also depend on what your fragmentation index is which is available in /sys/kernel/debug/extfrag/extfrag_index. I dont have a box at the moment with a new enough kernel to see what that shows to offer more than this.

Alternatively you could run some sort of cron job (this is horribly, horribly ugly) to manually compact memory yourself by writing into /proc/sys/vm/compact_memory.

In all honestly though, I don't think there really is a way to tune the system to avoid this problem -- its the nature of the memory allocator to work this way. Changing the architecture of the platform you use is probably the only fundamentally resolvable solution.

Solution 2:

Off the start: you should really go for a 64-bit operating system. Do you have a good reason to stay at 32-bit here?

It is hard to diagnose this problem without taking a look at the system more closely, preferably around the time it fails, so my (quick) post is more or less generically aimed at memory issues on 32-bit systems. Did I mention going 64-bit would make this all go away?

You problem is three-fold.

First of all, even on a PAE kernel, the per process address space is limited to 4GiB[1]. This means that your squid instance will never be able to eat more than 4GiB of RAM per process. I'm not that familiar with squid, but if this is your main proxy server, that might not be enough anyway.

Second, on a 32-bit system with vast amounts of RAM, a lot of memory in what is called 'ZONE_NORMAL' is used to store data structures that are needed to use memory in ZONE_HIGHMEM. These datastructure cannot be moved into ZONE_HIGHMEM themselves, because the memory the kernel uses for it's own purposes must always be in ZONE_NORMAL (i.e. in the first 1GiB-ish). The more memory you have in ZONE_HIGHMEM (a lot, in your case), the more this becomes a problem, because the kernel then needs more and more memory from ZONE_NORMAL to manage ZONE_HIGHMEM. As the amount of free memory in ZONE_NORMAL dries up, your system may fail at some tasks, because ZONE_NORMAL is where a lot of stuff happens on a 32-bit system. All the kernel related memory operations, for example ;)

Third, even if there is some memory left in ZONE_NORMAL (I haven't gone through your logs in detail), some memory operations will require unfragmented memory. For example, if all your memory is fragmented into really small pieces, some operations that need more than that, will fail. [3] A brief look at your logs does show a fairly significant amount of fragmentation in ZONE_DMA and ZONE_NORMAL.

Edit: Mlfe's answer above has an excellent explanation of how this works in detail.

Again: on a 64-bit system, all memory is in ZONE_NORMAL. There is no HIGHMEM zone on 64-bit systems. Problem solved.

Edit: You could take a look here [4] to see if you can tell oom-killer to leave your important processes alone. That will not solve everything (if anything at all), but it might be worth a try.

[1] http://en.wikipedia.org/wiki/Physical_address_extension#Design

[2] http://www.redhat.com/archives/rhelv5-list/2008-September/msg00237.html and https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Hardware_Architectures_and_Linux_Kernels-a32_bit_Architecture_and_the_hugemem_Kernel.html

[3] http://bl0rg.krunch.be/oom-frag.html

[4] http://lwn.net/Articles/317814/


Solution 3:

@MIfe already provided excellent write up about how memory allocations in kernel are handled and also provided you with proper solution like switching to 64-bit OS and nasty hack like manual memory compaction via /proc/sys/vm/compact_memory in cron.

My 2 cents would be another workaround that may help you:
I've noticed that you have tcp_tso_segment in your kernel backtrace, so doing:

# ethtool -K ethX tso off gso off lro off

can decrease pressure on mm by forcing it to use lower orders.

PS. list of all offloads can be obtained via # ethtool -k ethX


Solution 4:

The panic is because the sysctl "vm.panic_on_oom = 1" is set -- the idea is that rebooting the system returns it to a sane state. You can change this in sysctl.conf.

Right at the top we read squid invoked oom killer. You might check your squid configuration and its maximum memory usage (or just move to a 64-bit OS).

/proc/meminfo shows high memory zone in use, so you are running a 32-bit kernel with 36GB memory. You can also see that in the normal zone, in order to meet squid's demand for memory, the kernel scanned 982 pages without success:

pages_scanned:982 all_unreclaimable? yes