Out-Of-Memory in LAMP server - CentOS 5.9 x86 - Why now?

aks · Post by **aks** » 2015/08/18 19:45:43

I thought top's -m showed virtual memory (as in RSS + swap). Surely you are most interested in RSS?

InitOrNot · Post by **InitOrNot** » 2015/08/18 20:39:41

aks wrote:I thought top's -m showed virtual memory (as in RSS + swap). Surely you are most interested in RSS?

It is my understanding the -m parameter in CentOS's version of top (which is different from other tops, for example Debian's top) has the effect of sorting by the memory's RES value, i.e., by physical real memory usage.

I guess that whatever process spills into swap consuming all of it, should also be the process with higher physical RAM consumption. Yeah, I know it does not necessarily has to be that way, but I think that the odds of it being so are high.

What do you think?

aks · Post by **aks** » 2015/08/19 16:29:10

From an a CentOS 5 VM:

man top:
-m : VIRT/USED toggle
Reports USED (sum of process rss and swap total count) instead of VIRT

InitOrNot · Post by **InitOrNot** » 2015/08/21 08:03:42

aks wrote:From an a CentOS 5 VM:

man top:
-m : VIRT/USED toggle
Reports USED (sum of process rss and swap total count) instead of VIRT

From my CentOS 5.9 x86 server:

Code: Select all

(man top)
       -m : Sort by memory usage
            This switch makes top to sort the processes by allocated memory

In my tests, the -m parameter does the same than pressing 'M' interactively while on top's screen:

Code: Select all

(man top)
       SORTING of task window
         For compatibility, this top supports most of the former top sort keys.  Since this 
         is primarily a service to former top users, these commands do not appear on any help screen.
            command   sorted field                  supported
              A         start time (non-display)      No
              M         %MEM                          Yes
              N         PID                           Yes
              P         %CPU                          Yes
              T         TIME+                         Yes

Also, I've moved these two command to a plain user's crontab, as I see that root permissions are not needed to run them:

Code: Select all

$ crontab -l
*/5 * * * * top -b -n1 -m | head -30 > /tmp/top-sample_`date +\%Y-\%m-\%d_\%H-\%M-\%S`_.txt
56 20 * * * find /tmp/top-sample_* -type f -mtime +2 -print0 | xargs -r -0 rm

So far the server is mostly idle, this is the last report from just some minutes ago:

Code: Select all

$ cat /tmp/top-sample_2015-08-21_10-00-01_.txt
top - 10:00:01 up 3 days, 10:30,  6 users,  load average: 0.02, 0.07, 0.08
Tasks:  91 total,   1 running,  90 sleeping,   0 stopped,   0 zombie
Cpu(s):  7.1%us,  1.2%sy,  0.0%ni, 88.5%id,  3.2%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   2075016k total,  2015992k used,    59024k free,    63016k buffers
Swap:  2097144k total,       72k used,  2097072k free,  1081352k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
13948 apache    18   0  137m  61m 8464 S  0.0  3.0   1:35.08 httpd
 2400 root      18   0 94216  60m  50m S  0.0  3.0   0:11.67 httpd
12772 apache    15   0  133m  56m 8536 S  0.0  2.8   1:57.91 httpd
15021 apache    15   0  133m  56m 8380 S  0.0  2.8   1:11.29 httpd
15032 apache    15   0  132m  56m 8516 S  0.0  2.8   1:14.92 httpd
12773 apache    18   0  132m  56m 8456 S  0.0  2.8   1:50.49 httpd
13971 apache    15   0  132m  56m 8452 S  0.0  2.8   1:37.29 httpd
17695 apache    15   0  132m  56m 8220 S  0.0  2.8   0:18.12 httpd
15009 apache    16   0  132m  55m 8372 S  0.0  2.8   1:09.46 httpd
13967 apache    15   0  132m  55m 8404 S  0.0  2.7   1:44.37 httpd
15027 apache    15   0  131m  55m 8408 S  0.0  2.7   1:09.44 httpd
13969 apache    17   0  131m  55m 8456 S  0.0  2.7   1:42.23 httpd
11846 apache    15   0  130m  53m 8572 S  0.0  2.7   2:12.39 httpd
15004 apache    15   0  128m  52m 8448 S  0.0  2.6   1:08.06 httpd
15008 apache    17   0  128m  51m 8464 S  0.0  2.6   1:11.67 httpd
15020 apache    15   0  127m  51m 8416 S  0.0  2.5   1:00.51 httpd
17414 apache    18   0  124m  49m 8196 S  0.0  2.5   0:20.22 httpd
17415 apache    15   0  124m  49m 8196 S  0.0  2.5   0:21.10 httpd
17413 apache    18   0  126m  49m 8228 S  0.0  2.5   0:23.77 httpd
17416 apache    15   0  124m  49m 8120 S  0.0  2.4   0:29.15 httpd
17427 apache    20   0  126m  49m 8372 S  0.0  2.4   0:29.11 httpd
 5176 mysql     15   0  151m  33m 4908 S  0.0  1.7  30:47.08 mysqld
 2169 ntp       15   0  4532 4528 3516 S  0.0  0.2   0:00.12 ntpd

InitOrNot · Post by **InitOrNot** » 2015/09/04 23:47:07

InitOrNot wrote:Here is a pastebin of the relevant logs, can anyone spot something out of the ordinary in them?

http://pastebin.com/raw.php?i=V3Ps2vNC

I've been digging around, and it may be that I have hit a bug in the Linux kernel: system stalling in infinite loop when reaching an OOM condition because the OOM-killer cannot complete the killing of candidate processes. It appears there is a design fault in the memory management subsystem of the Linux kernel, which can trigger this problem in certain special cases. This kernel problem is still unresolved.

More info here: https://lwn.net/Articles/627419/

If you review my pasted logs above, you will see that the OOM-killer is triggered and starts killing processes, apparently with success (Postfix master, several Apache httpd), until it tries to kill mysqld (hour 01:53:12) but cannot finish that kill, and from that point onwards the system stalls with 100% consumption (hour 03:23:49 and following, with message "INFO: task mysqld:9402 blocked for more than 120 seconds" which appears several times after then), until I had to power cycle the system at hour 14:52:14.

So that explains why the OOM-killer did NOT return the system to an usable condition (albeit with its main application services killed).

I still have to find which process, MySQL or Apache, was who consumed all the RAM... The problem has not reoccurred since I upped the RAM to 2 GB.

Additional info: https://lwn.net/Articles/627436/

From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 22 Dec 2014 17:16:43 -0500
Subject: [patch] mm: page_alloc: avoid page allocation vs. OOM killing
deadlock

The page allocator per default does not ever give up on allocations up
to order 3, and instead it keeps the allocating task in a loop running
direct reclaim and invoking the OOM killer. The assumed reason behind
this decade-old behavior is that the system is unusable once orders of
such small size start to fail, and the allocator might as well keep
killing processes one-by-one until the situation is rectified.

However, the allocating task itself might be holding locks that the
OOM victim might need to exit, and, to preserve the emergency memory
reserves, the OOM killer doesn't move on to the next victim until the
first choice has exited. The result is a deadlock between the task
that is trying to allocate and the OOM victim that can't be resolved
without a third party exiting or volunteering unreclaimable memory.
(...snip...)

CentOS

Out-Of-Memory in LAMP server - CentOS 5.9 x86 - Why now?

Re: Out-Of-Memory in LAMP server - CentOS 5.9 x86 - Why now?

Re: Out-Of-Memory in LAMP server - CentOS 5.9 x86 - Why now?

Re: Out-Of-Memory in LAMP server - CentOS 5.9 x86 - Why now?

Re: Out-Of-Memory in LAMP server - CentOS 5.9 x86 - Why now?

Re: Out-Of-Memory in LAMP server - CentOS 5.9 x86 - Why now?