Linux内核OOM机制分析

一应用场景描述

线上一台mongos出现OOM情况，于是花点时间想要详细了解Linux内核的OOM机制原理，便于以后再作分析

$ sudo grep mongos /var/log/messages 
Apr 10 15:35:38 localhost sz[32066]: [xxxx] check_mongos.sh/ZMODEM: 211 Bytes, 229 BPS
Apr 23 14:50:18 localhost sz[5794]: [xxxxx] mongos/ZMODEM: 297 Bytes, 151 BPS
Apr 23 15:01:55 localhost kernel: [20387]   497 20387   694326   427932   0       0             0 mongos
Apr 23 15:01:55 localhost kernel: Out of memory: Kill process 20387 (mongos) score 890 or sacrifice child
Apr 23 15:01:55 localhost kernel: Killed process 20387, UID 497, (mongos) total-vm:2777304kB, anon-rss:1711700kB, file-rss:28kB

mongos这台机器的内存不足触发了Linux内核的OOM机制，然后把mongos进程给kill掉了

下载Linux内核源码查看OOM相关代码

查看oom_kill.c源代码里面的内容

linux-2.6.32.65/mm/oom_kill.c

/*
 *  linux/mm/oom_kill.c
 * 
 *  Copyright (C)  1998,2000  Rik van Riel
 *      Thanks go out to Claus Fischer for some serious inspiration and
 *      for goading me into coding this file...
 *
 *  The routines in this file are used to kill a process when
 *  we‘re seriously out of memory. This gets called from __alloc_pages()
 *  in mm/page_alloc.c when we really run out of memory.
 *
 *  Since we won‘t call these routines often (on a well-configured
 *  machine) this file will double as a ‘coding guide‘ and a signpost
 *  for newbie kernel hackers. It features several pointers to major
 *  kernel subsystems and hints as to where to find out what things do.
 */

这个文件的步骤用于当内存严重耗尽时如何去选择性地杀掉一个进程。这些步奏不经常调用。

/**
 * badness - calculate a numeric value for how bad this task has been
 * @p: task struct of which task we should calculate
 * @uptime: current uptime in seconds
 *
 * The formula used is relatively simple and documented inline in the
 * function. The main rationale is that we want to select a good task
 * to kill when we run out of memory.
 *
 * Good in this context means that:
 * 1) we lose the minimum amount of work done
 * 2) we recover a large amount of memory
 * 3) we don‘t kill anything innocent of eating tons of memory
 * 4) we want to kill the minimum amount of processes (one)
 * 5) we try to kill the process the user expects us to kill, this
 *    algorithm has been meticulously tuned to meet the principle
 *    of least surprise ... (be careful when you change it)
 */

unsigned long badness(struct task_struct *p, unsigned long uptime)

badness函数会为每个进程计算一个值来描述这个任务有多bad

所要选择被杀死的进程符合以下特征：

1）杀掉这个进程会花费最少量的工作

2）杀掉这个进程后会恢复很大一部分内存

3）不杀掉任何消耗大量内存的无辜进程

4）尽可能地杀掉最少量的进程

5）尝试杀掉用于希望杀死的进程

 unsigned long points, cpu_time, run_time;
        struct mm_struct *mm;
        struct task_struct *child;
        int oom_adj = p->signal->oom_adj;
        struct task_cputime task_time;
        unsigned long utime;
        unsigned long stime;

 /*
         * The memory size of the process is the basis for the badness.
         */
        points = mm->total_vm;

进程使用的内存大小是判断badness的基础

 /*
         * swapoff can easily use up all memory, so kill those first.
         */
        if (p->flags & PF_OOM_ORIGIN)
                return ULONG_MAX;

swapoff最容易用光所有内存，先杀掉这些进程

 /*
         * Processes which fork a lot of child processes are likely
         * a good choice. We add half the vmsize of the children if they
         * have an own mm. This prevents forking servers to flood the
         * machine with an endless amount of children. In case a single
         * child is eating the vast majority of memory, adding only half
         * to the parents will make the child our kill candidate of choice.
         */
        list_for_each_entry(child, &p->children, sibling) {
                task_lock(child);
                if (child->mm != mm && child->mm)
                        points += child->mm->total_vm/2 + 1;
                task_unlock(child);
        }

那些fork出很多子进程的进程是一个很好的选择。

 /*
         * CPU time is in tens of seconds and run time is in thousands
         * of seconds. There is no particular reason for this other than
         * that it turned out to work very well in practice.
         */
        thread_group_cputime(p, &task_time);
        utime = cputime_to_jiffies(task_time.utime);
        stime = cputime_to_jiffies(task_time.stime);
        cpu_time = (utime + stime) >> (SHIFT_HZ + 3);

  /*
         * Niced processes are most likely less important, so double
         * their badness points.
         */
        if (task_nice(p) > 0)
                points *= 2;

设置nice值得进程是最可能不重要的进程，这里讲他们的badness得分加倍

 /*
         * Superuser processes are usually more important, so we make it
         * less likely that we kill those.
         */
        if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
            has_capability_noaudit(p, CAP_SYS_RESOURCE))
                points /= 4;

使用超级用户运行的进程是很重要的进程，所以最不可能被杀掉的进程

 /*
         * We don‘t want to kill a process with direct hardware access.
         * Not only could that mess up the hardware, but usually users
         * tend to only have this flag set on applications they think
         * of as important.
         */
        if (has_capability_noaudit(p, CAP_SYS_RAWIO))
                points /= 4;

直接访问硬件的进程不容易被杀掉

 /*
         * If p‘s nodes don‘t overlap ours, it may still help to kill p
         * because p may have allocated or otherwise mapped memory on
         * this node before. However it will be less likely.
         */
        if (!has_intersects_mems_allowed(p))
                points /= 8;

/*
         * Adjust the score by oom_adj.
         */
        if (oom_adj) {
                if (oom_adj > 0) {
                        if (!points)
                                points = 1;
                        points <<= oom_adj;
                } else
                        points >>= -(oom_adj);
        }

通过oom_adj来调整得分

ifdef DEBUG
        printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
        p->pid, p->comm, points);
#endif
        return points;
}

输出得分

/*
 * Simple selection loop. We chose the process with the highest
 * number of ‘points‘. We expect the caller will lock the tasklist.
 *
 * (not docbooked, we don‘t want this one cluttering up the manual)
 */

循环比较，选出得分最高的进程。

 /*
                 * skip kernel threads and tasks which have already released
                 * their mm.
                 */
                if (!p->mm)
                        continue;
                /* skip the init task */
                if (is_global_init(p))
                        continue;
                if (mem && !task_in_mem_cgroup(p, mem))
                        continue;

跳过那些已经释放到内存的内核线程和任务，跳过init task

/**
 * dump_tasks - dump current memory state of all system tasks
 * @mem: target memory controller
 *
 * Dumps the current memory state of all system tasks, excluding kernel threads.
 * State information includes task‘s pid, uid, tgid, vm size, rss, cpu, oom_adj
 * score, and name.
 *
 * If the actual is non-NULL, only tasks that are a member of the mem_cgroup are
 * shown.
 *
 * Call with tasklist_lock read-locked.
 */

static void dump_tasks(const struct mem_cgroup *mem)

/*
 * Send SIGKILL to the selected  process irrespective of  CAP_SYS_RAW_IO
 * flag though it‘s unlikely that  we select a process with CAP_SYS_RAW_IO
 * set.
 */
static void __oom_kill_task(struct task_struct *p, int verbose)

   /*
         * If the task is already exiting, don‘t alarm the sysadmin or kill
         * its children or threads, just set TIF_MEMDIE so it can die quickly
         */

 /* Try to kill a child first */

**
 * out_of_memory - kill the "best" process when we run out of memory
 * @zonelist: zonelist pointer
 * @gfp_mask: memory allocation flags
 * @order: amount of memory being requested as a power of 2
 *
 * If we run out of memory, we have the choice between either
 * killing a random task (bad), letting the system crash (worse)
 * OR try to be smart about which process to kill. Note that we
 * don‘t have to be perfect here, we just have to be good.
 */

内存溢出，当内存溢出时杀掉最优的进程。

如果出现内存溢出，要么选择随机杀掉一个进程或者直接让系统崩溃，或者尝试有选择性地杀掉一个进程。

参考文章：

http://blog.chinaunix.net/uid-20788636-id-4308527.html

http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html

时间： 2024-08-04 11:18:13

Linux内核OOM机制分析的相关文章

Linux内核OOM机制的详细分析(转)

Linux 内核有个机制叫OOM killer(Out-Of-Memory killer),该机制会监控那些占用内存过大,尤其是瞬间很快消耗大量内存的进程,为了防止内存耗尽而内核会把该进程杀掉.典型的情况是:某天一台机器突然ssh远程登录不了,但能ping通,说明不是网络的故障,原因是sshd进程被 OOM killer杀掉了(多次遇到这样的假死状况).重启机器后查看系统日志/var/log/messages会发现 Out of Memory: Kill process 1865(sshd)

Linux内核OOM机制的详细分析

http://blog.chinaunix.net/uid-29242873-id-3942763.html Linux 内核有个机制叫OOM killer(Out-Of-Memory killer),该机制会监控那些占用内存过大,尤其是瞬间很快消耗大量内存的进程,为了防止内存耗尽而内核会把该进程杀掉.典型的情况是:某天一台机器突然ssh远程登录不了,但能ping通,说明不是网络的故障,原因是sshd进程被OOM killer杀掉了(多次遇到这样的假死状况).重启机器后查看系统日志/var/lo

【转】linux内核kallsyms机制分析

一.前言 Linux内核是一个整体结构,而模块是插入到内核中的插件.尽管内核不是一个可安装模块,但为了方便起见,Linux把内核也看作一个模块.那么模块与模块之间如何进行交互呢,一种常用的方法就是共享变量和函数.但并不是模块中的每个变量和函数都能被共享,内核只把各个模块中主要的变量和函数放在一个特定的区段,这些变量和函数就统称为符号. 因此,内核也有一个module结构,叫做kernel_module.另外,从kernel_module开始,所有已安装模块的module结构都链在一起成为一条链,

Linux内核NAPI机制分析

转自:http://blog.chinaunix.net/uid-17150-id-2824051.html 简介:NAPI 是 Linux 上采用的一种提高网络处理效率的技术,它的核心概念就是不采用中断的方式读取数据,而代之以首先采用中断唤醒数据接收的服务程序,然后 POLL 的方法来轮询数据.随着网络的接收速度的增加,NIC 触发的中断能做到不断减少,目前 NAPI 技术已经在网卡驱动层和网络层得到了广泛的应用,驱动层次上已经有 E1000 系列网卡,RTL8139 系列网卡,3c50X 系

Linux内核源码分析--内核启动之(5)Image内核启动(rest_init函数)（Linux-3.0 ARMv7）【转】

原文地址:Linux内核源码分析--内核启动之(5)Image内核启动(rest_init函数)(Linux-3.0 ARMv7) 作者:tekkamanninja 转自:http://blog.chinaunix.net/uid-25909619-id-4938395.html 前面粗略分析start_kernel函数,此函数中基本上是对内存管理和各子系统的数据结构初始化.在内核初始化函数start_kernel执行到最后,就是调用rest_init函数,这个函数的主要使命就是创建并启动内核线

Linux内核源码分析--内核启动之(3)Image内核启动(C语言部分)（Linux-3.0 ARMv7）【转】

原文地址:Linux内核源码分析--内核启动之(3)Image内核启动(C语言部分)(Linux-3.0 ARMv7) 作者:tekkamanninja 转自:http://blog.chinaunix.net/uid-25909619-id-4938390.html 在构架相关的汇编代码运行完之后,程序跳入了构架无关的内核C语言代码:init/main.c中的start_kernel函数,在这个函数中Linux内核开始真正进入初始化阶段, 下面我就顺这代码逐个函数的解释,但是这里并不会过于深入

Linux内核同步机制

http://blog.csdn.net/bullbat/article/details/7376424 Linux内核同步控制方法有很多,信号量.锁.原子量.RCU等等,不同的实现方法应用于不同的环境来提高操作系统效率.首先,看看我们最熟悉的两种机制——信号量.锁. 一.信号量首先还是看看内核中是怎么实现的,内核中用struct semaphore数据结构表示信号量(<linux/semphone.h>中): [cpp] view plaincopyprint? struct semaph

Linux内核源代码情景分析-访问权限与文件安全性

在Linux内核源代码情景分析-从路径名到目标节点,一文中path_walk代码中,err = permission(inode, MAY_EXEC)当前进程是否可以访问这个节点,代码如下: int permission(struct inode * inode,int mask) { if (inode->i_op && inode->i_op->permission) { int retval; lock_kernel(); retval = inode->i_

Linux内核源代码情景分析-共享内存

一.库函数shmget()--共享内存区的创建与寻找 asmlinkage long sys_shmget (key_t key, size_t size, int shmflg) { struct shmid_kernel *shp; int err, id = 0; down(&shm_ids.sem); if (key == IPC_PRIVATE) { err = newseg(key, shmflg, size);//分配一个共享内存区供本进程专用,最后返回的是一体化的标示号 } el