Linux: the schedule algorithm in Linux kernel

Linux kernel里面用到的一个叫 CFS (Completely-Fair-Scheduler)的调度算法。在网上找的描述都很不直观,很难读。但是找到了一篇很通俗易懂的(大道至简啊。。。):

http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt

为了防止链接失效,粘贴全文如下:

This is the CFS scheduler.

80% of CFS‘s design can be summed up in a single sentence: CFS basically
models an "ideal, precise multi-tasking CPU" on real hardware.

"Ideal multi-tasking CPU" is a (non-existent  :-))  CPU that has 100%
physical power and which can run each task at precise equal speed, in
parallel, each at 1/nr_running speed. For example: if there are 2 tasks
running then it runs each at 50% physical power - totally in parallel.

On real hardware, we can run only a single task at once, so while that
one task runs, the other tasks that are waiting for the CPU are at a
disadvantage - the current task gets an unfair amount of CPU time. In
CFS this fairness imbalance is expressed and tracked via the per-task
p->wait_runtime (nanosec-unit) value. "wait_runtime" is the amount of
time the task should now run on the CPU for it to become completely fair
and balanced.

( small detail: on ‘ideal‘ hardware, the p->wait_runtime value would
  always be zero - no task would ever get ‘out of balance‘ from the
  ‘ideal‘ share of CPU time. )

CFS‘s task picking logic is based on this p->wait_runtime value and it
is thus very simple: it always tries to run the task with the largest
p->wait_runtime value. In other words, CFS tries to run the task with
the ‘gravest need‘ for more CPU time. So CFS always tries to split up
CPU time between runnable tasks as close to ‘ideal multitasking
hardware‘ as possible.

Most of the rest of CFS‘s design just falls out of this really simple
concept, with a few add-on embellishments like nice levels,
multiprocessing and various algorithm variants to recognize sleepers.

In practice it works like this: the system runs a task a bit, and when
the task schedules (or a scheduler tick happens) the task‘s CPU usage is
‘accounted for‘: the (small) time it just spent using the physical CPU
is deducted(减) from p->wait_runtime. [minus the ‘fair share‘ it would have
gotten anyway]. Once p->wait_runtime gets low enough so that another
task becomes the ‘leftmost task‘ of the time-ordered rbtree it maintains
(plus a small amount of ‘granularity‘ distance relative to the leftmost
task so that we do not over-schedule tasks and trash the cache) then the
new leftmost task is picked and the current task is preempted.

The rq->fair_clock value tracks the ‘CPU time a runnable task would have
fairly gotten, had it been runnable during that time‘. So by using
rq->fair_clock values we can accurately timestamp and measure the
‘expected CPU time‘ a task should have gotten. All runnable tasks are
sorted in the rbtree by the "rq->fair_clock - p->wait_runtime" key, and
CFS picks the ‘leftmost‘ task and sticks to it. As the system progresses
forwards, newly woken tasks are put into the tree more and more to the
right - slowly but surely giving a chance for every task to become the
‘leftmost task‘ and thus get on the CPU within a deterministic amount of
time.

Some implementation details:

 - the introduction of Scheduling Classes: an extensible hierarchy of
   scheduler modules. These modules encapsulate scheduling policy
   details and are handled by the scheduler core without the core
   code assuming about them too much.

 - sched_fair.c implements the ‘CFS desktop scheduler‘: it is a
   replacement for the vanilla scheduler‘s SCHED_OTHER interactivity
   code.

   I‘d like to give credit to Con Kolivas for the general approach here:
   he has proven via RSDL/SD that ‘fair scheduling‘ is possible and that
   it results in better desktop scheduling. Kudos Con!

   The CFS patch uses a completely different approach and implementation
   from RSDL/SD. My goal was to make CFS‘s interactivity quality exceed
   that of RSDL/SD, which is a high standard to meet :-) Testing
   feedback is welcome to decide this one way or another. [ and, in any
   case, all of SD‘s logic could be added via a kernel/sched_sd.c module
   as well, if Con is interested in such an approach. ]

   CFS‘s design is quite radical: it does not use runqueues, it uses a
   time-ordered rbtree to build a ‘timeline‘ of future task execution,
   and thus has no ‘array switch‘ artifacts (by which both the vanilla
   scheduler and RSDL/SD are affected).

   CFS uses nanosecond granularity accounting and does not rely on any
   jiffies or other HZ detail. Thus the CFS scheduler has no notion of
   ‘timeslices‘ and has no heuristics whatsoever. There is only one
   central tunable:

         /proc/sys/kernel/sched_granularity_ns

   which can be used to tune the scheduler from ‘desktop‘ (low
   latencies) to ‘server‘ (good batching) workloads. It defaults to a
   setting suitable for desktop workloads. SCHED_BATCH is handled by the
   CFS scheduler module too.

   Due to its design, the CFS scheduler is not prone to any of the
   ‘attacks‘ that exist today against the heuristics of the stock
   scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
   work fine and do not impact interactivity and produce the expected
   behavior.

   the CFS scheduler has a much stronger handling of nice levels and
   SCHED_BATCH: both types of workloads should be isolated much more
   agressively than under the vanilla scheduler.

   ( another detail: due to nanosec accounting and timeline sorting,
     sched_yield() support is very simple under CFS, and in fact under
     CFS sched_yield() behaves much better than under any other
     scheduler i have tested so far. )

 - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
   way than the vanilla scheduler does. It uses 100 runqueues (for all
   100 RT priority levels, instead of 140 in the vanilla scheduler)
   and it needs no expired array.

 - reworked/sanitized SMP load-balancing: the runqueue-walking
   assumptions are gone from the load-balancing code now, and
   iterators of the scheduling modules are used. The balancing code got
   quite a bit simpler as a result.

有空会去读 《Linux kernel development》和 《Understanding Linux kernrl》这本书。里面都有讲。还有代码示例。当然Linux的Schedule algorithm肯定不止这一种。持续更新。

:)
时间: 2024-10-08 07:18:48

Linux: the schedule algorithm in Linux kernel的相关文章

【翻译自mos文章】Oracle Linux 5.9到Oracle Linux 6.4 之间的bug--Wrong Kernel Statistic

Oracle Linux 5.9到Oracle Linux 6.4 之间的bug--Wrong Kernel Statistic in Version 2.6.39-400.X.X 参考自: Wrong Kernel Statistic in Version 2.6.39-400.X.X (Doc ID 1552713.1) 适用于: Linux OS - Version Oracle Linux 5.9 with Unbreakable Enterprise Kernel [2.6.39] t

手把手带你自制Linux系统之二 简易Linux制作

手把手带你自制Linux系统之二 简易Linux制作 本文利用CentOS5.5自带内核制作一个Mini Linux. 打开准备工作中创建的CentOS,为另一个虚拟机MiniLinux添加一个最小Linux所需要的文件. 1. 创建分区 为准备好的磁盘创建两个主分区,大小分别为20M和512M. 使用fdisk命令创建分区详细过程: fdisk /dev/hda 创建第一个20M分区依次输入: n --> p --> 1 --> <Enter> --> +20M 这几

Linux学习笔记之(1)~Linux有趣的历史概览

献给知道mono,了解Jexus,对.net混搭技术感兴趣的朋友. 1965年,Bell.MIT和GE公司发起Multics计划,目标是实现一个操作系统可以让大型主机实现连接三百个终端的目标.(那个时候的分时操作系统可不像现在...) 1969年,Multics计划滞后,资金紧缺,Bell实验室退出该计划,但原本参与Multics计划的人员,却从中得到了一些启发. Ken Thompson就是其中一位. 据说 Ken Thompson为了移植一套"太空旅游"的游戏,希望研发一套操作系统

每天一个Linux命令(23)--linux 目录结构(一)

对于每一个Linux 学习者来说,了解 Linux 文件系统的目录结构,是学好Linux 的至关重要的一步,深入了解Linux 文件目录结构的标准和每个目录的详细功能,对于我们用好Linux 系统至关重要,下面我们就开始了解一下 Linux 目录结构的相关知识. 当在使用Linux 的时候,如果您通过 ls   -l   / 就会发现,在 /  下包涵很多的目录,比如 etc  usr  var   bin  等目录,而在这些目录中,我们进去看看,发现也有很多的目录或文件.文件系统在 Linux

Linux系统简介 、 安装Linux系统 、 RHEL6基本操作

Linux系统简介 安装Linux操作系统 RHEL6基本操作 ################################################# 一.Linux系统简介 1. Linux是一类操作系统 计算机系统=硬件+软件 |--> 软件包括:操作系统.各种应用 Linux操作系统=内核程序+外围程序 2. 常见的三大类操作系统 Unix系列      --> 1970年01月01日诞生,FreeBSD.贝尔实验室.IBM.惠普.Oracle等公司 Windows系列   

Linux系统管理(一)Linux系统安装与修复

一. Linux多种安装方式与应用软件安装 1. Linux的硬盘安装方式 (1) 复制Linux的ISO文件到硬盘某个分区 (2) 用Linux安装启动盘启动/DOS启动盘启动计算机 注: DOS启动盘中必须要存放加载Linux系统的工具及Linux内核和initrd映像文件,主要文件包括: -loadlin.exe   //加载Linux系统的工具 -vmlinuz       //Linux内核文件 -initrd.img    //initrd映像文件 -autoexec.bat  //

20135239 益西拉姆 linux内核分析 跟踪分析Linux内核的启动过程

回顾 1.中断上下文的切换——保存现场&恢复现场 本节主要课程内容 Linux内核源代码简介 1.打开内核源代码页面 arch/目录:支持不同CPU的源代码:其中的X86是重点 init/目录:内核启动相关的代码基本都在该目录中(比如main.c等) start_kernel函数就相当于普通C程序的main函数 kernel/目录:Linux内核核心代码在kernel目录中 README 介绍了什么是Linux,Linux能够在哪些硬件上运行,如何安装内核源代码等 构造一个简单的linux系统m

Linux 小知识翻译 - 「Linux」和「发行版」之间的关系

「Linux」本来指的仅仅是内核.5年之前大多都是这么认为的,但是最近不这么说了. 最近一般都说「Linux」是个 OS,这里的OS,不仅仅是内核,而是指电脑的整体环境(除了内核,还包括一些外围的软件). 内核本来是作为硬件和各种应用软件之间的桥梁而存在的,只有内核的PC是无法使用的. 因此,会将各式各样的软件和内核组合在一起,作为一个可以运行的OS来打包,打包后的OS就被称为「Linux发行版」. 最近,把「Linux发行版」称为「Linux」的情况也比较多了. 但是,「Linux内核」只有一

Linux下搭建PXE服务器安装Linux系统

PXE服务器安装与配置 花了一个星期主要研究这个,查阅了国内外相关的资料,发现这方面的东西还是比较少的,至少还没有完备的一个体系,这次测试过程发布出来 希望能帮助到一些人,有什么不懂得可以提出来,大家一起讨论. 原理: PXE是在没有软驱.硬盘.CD-ROM的情况下引导计算机的一种方式,也就是BIOS将使用PXE协议从网络引导. DHCP服务器:用来动态分配IP地址(同时分配子网掩码.网关.TFTP服务器地址.启动文件名.DNS服务器.时间服务器等等). TFTP服务器:用来提供启动文件的下载