Linux之epoll实现

  1. /*
  2. * fs/eventpoll.c (Efficient event retrieval implementation)
  3. * Copyright (C) 2001,...,2009 Davide Libenzi
  4. *
  5. * This program is free software; you can redistribute it and/or modify
  6. * it under the terms of the GNU General Public License as published by
  7. * the Free Software Foundation; either version 2 of the License, or
  8. * (at your option) any later version.
  9. *
  10. * Davide Libenzi <[email protected]>
  11. *
  12. */
  13. /*
  14. * 在深入了解epoll的实现之前, 先来了解内核的3个方面.
  15. * 1. 等待队列 waitqueue
  16. * 我们简单解释一下等待队列:
  17. * 队列头(wait_queue_head_t)往往是资源生产者,
  18. * 队列成员(wait_queue_t)往往是资源消费者,
  19. * 当头的资源ready后, 会逐个执行每个成员指定的回调函数,
  20. * 来通知它们资源已经ready了, 等待队列大致就这个意思.
  21. * 2. 内核的poll机制
  22. * 被Poll的fd, 必须在实现上支持内核的Poll技术,
  23. * 比如fd是某个字符设备,或者是个socket, 它必须实现
  24. * file_operations中的poll操作, 给自己分配有一个等待队列头.
  25. * 主动poll fd的某个进程必须分配一个等待队列成员, 添加到
  26. * fd的对待队列里面去, 并指定资源ready时的回调函数.
  27. * 用socket做例子, 它必须有实现一个poll操作, 这个Poll是
  28. * 发起轮询的代码必须主动调用的, 该函数中必须调用poll_wait(),
  29. * poll_wait会将发起者作为等待队列成员加入到socket的等待队列中去.
  30. * 这样socket发生状态变化时可以通过队列头逐个通知所有关心它的进程.
  31. * 这一点必须很清楚的理解, 否则会想不明白epoll是如何
  32. * 得知fd的状态发生变化的.
  33. * 3. epollfd本身也是个fd, 所以它本身也可以被epoll,
  34. * 可以猜测一下它是不是可以无限嵌套epoll下去...
  35. *
  36. * epoll基本上就是使用了上面的1,2点来完成.
  37. * 可见epoll本身并没有给内核引入什么特别复杂或者高深的技术,
  38. * 只不过是已有功能的重新组合, 达到了超过select的效果.
  39. */
  40. /*
  41. * 相关的其它内核知识:
  42. * 1. fd我们知道是文件描述符, 在内核态, 与之对应的是struct file结构,
  43. * 可以看作是内核态的文件描述符.
  44. * 2. spinlock, 自旋锁, 必须要非常小心使用的锁,
  45. * 尤其是调用spin_lock_irqsave()的时候, 中断关闭, 不会发生进程调度,
  46. * 被保护的资源其它CPU也无法访问. 这个锁是很强力的, 所以只能锁一些
  47. * 非常轻量级的操作.
  48. * 3. 引用计数在内核中是非常重要的概念,
  49. * 内核代码里面经常有些release, free释放资源的函数几乎不加任何锁,
  50. * 这是因为这些函数往往是在对象的引用计数变成0时被调用,
  51. * 既然没有进程在使用在这些对象, 自然也不需要加锁.
  52. * struct file 是持有引用计数的.
  53. */
  54. /* --- epoll相关的数据结构 --- */
  55. /*
  56. * This structure is stored inside the "private_data" member of the file
  57. * structure and rapresent the main data sructure for the eventpoll
  58. * interface.
  59. */
  60. /* 每创建一个epollfd, 内核就会分配一个eventpoll与之对应, 可以说是
  61. * 内核态的epollfd. */
  62. struct eventpoll {
  63. /* Protect the this structure access */
  64. spinlock_t lock;
  65. /*
  66. * This mutex is used to ensure that files are not removed
  67. * while epoll is using them. This is held during the event
  68. * collection loop, the file cleanup path, the epoll file exit
  69. * code and the ctl operations.
  70. */
  71. /* 添加, 修改或者删除监听fd的时候, 以及epoll_wait返回, 向用户空间
  72. * 传递数据时都会持有这个互斥锁, 所以在用户空间可以放心的在多个线程
  73. * 中同时执行epoll相关的操作, 内核级已经做了保护. */
  74. struct mutex mtx;
  75. /* Wait queue used by sys_epoll_wait() */
  76. /* 调用epoll_wait()时, 我们就是"睡"在了这个等待队列上... */
  77. wait_queue_head_t wq;
  78. /* Wait queue used by file->poll() */
  79. /* 这个用于epollfd本事被poll的时候... */
  80. wait_queue_head_t poll_wait;
  81. /* List of ready file descriptors */
  82. /* 所有已经ready的epitem都在这个链表里面 */
  83. struct list_head rdllist;
  84. /* RB tree root used to store monitored fd structs */
  85. /* 所有要监听的epitem都在这里 */
  86. struct rb_root rbr;
  87. /*
  88. * This is a single linked list that chains all the "struct epitem" that
  89. * happened while transfering ready events to userspace w/out
  90. * holding ->lock.
  91. */
  92. struct epitem *ovflist;
  93. /* The user that created the eventpoll descriptor */
  94. /* 这里保存了一些用户变量, 比如fd监听数量的最大值等等 */
  95. struct user_struct *user;
  96. };
  97. /*
  98. * Each file descriptor added to the eventpoll interface will
  99. * have an entry of this type linked to the "rbr" RB tree.
  100. */
  101. /* epitem 表示一个被监听的fd */
  102. struct epitem {
  103. /* RB tree node used to link this structure to the eventpoll RB tree */
  104. /* rb_node, 当使用epoll_ctl()将一批fds加入到某个epollfd时, 内核会分配
  105. * 一批的epitem与fds们对应, 而且它们以rb_tree的形式组织起来, tree的root
  106. * 保存在epollfd, 也就是struct eventpoll中.
  107. * 在这里使用rb_tree的原因我认为是提高查找,插入以及删除的速度.
  108. * rb_tree对以上3个操作都具有O(lgN)的时间复杂度 */
  109. struct rb_node rbn;
  110. /* List header used to link this structure to the eventpoll ready list */
  111. /* 链表节点, 所有已经ready的epitem都会被链到eventpoll的rdllist中 */
  112. struct list_head rdllink;
  113. /*
  114. * Works together "struct eventpoll"->ovflist in keeping the
  115. * single linked chain of items.
  116. */
  117. /* 这个在代码中再解释... */
  118. struct epitem *next;
  119. /* The file descriptor information this item refers to */
  120. /* epitem对应的fd和struct file */
  121. struct epoll_filefd ffd;
  122. /* Number of active wait queue attached to poll operations */
  123. int nwait;
  124. /* List containing poll wait queues */
  125. struct list_head pwqlist;
  126. /* The "container" of this item */
  127. /* 当前epitem属于哪个eventpoll */
  128. struct eventpoll *ep;
  129. /* List header used to link this item to the "struct file" items list */
  130. struct list_head fllink;
  131. /* The structure that describe the interested events and the source fd */
  132. /* 当前的epitem关系哪些events, 这个数据是调用epoll_ctl时从用户态传递过来 */
  133. struct epoll_event event;
  134. };
  135. struct epoll_filefd {
  136. struct file *file;
  137. int fd;
  138. };
  139. /* Wait structure used by the poll hooks */
  140. struct eppoll_entry {
  141. /* List header used to link this structure to the "struct epitem" */
  142. struct list_head llink;
  143. /* The "base" pointer is set to the container "struct epitem" */
  144. struct epitem *base;
  145. /*
  146. * Wait queue item that will be linked to the target file wait
  147. * queue head.
  148. */
  149. wait_queue_t wait;
  150. /* The wait queue head that linked the "wait" wait queue item */
  151. wait_queue_head_t *whead;
  152. };
  153. /* Wrapper struct used by poll queueing */
  154. struct ep_pqueue {
  155. poll_table pt;
  156. struct epitem *epi;
  157. };
  158. /* Used by the ep_send_events() function as callback private data */
  159. struct ep_send_events_data {
  160. int maxevents;
  161. struct epoll_event __user *events;
  162. };
  163. /* --- 代码注释 --- */
  164. /* 你没看错, 这就是epoll_create()的真身, 基本啥也不干直接调用epoll_create1了,
  165. * 另外你也可以发现, size这个参数其实是没有任何用处的... */
  166. SYSCALL_DEFINE1(epoll_create, int, size)
  167. {
  168. if (size <= 0)
  169. return -EINVAL;
  170. return sys_epoll_create1(0);
  171. }
  172. /* 这才是真正的epoll_create啊~~ */
  173. SYSCALL_DEFINE1(epoll_create1, int, flags)
  174. {
  175. int error;
  176. struct eventpoll *ep = NULL;//主描述符
  177. /* Check the EPOLL_* constant for consistency. */
  178. /* 这句没啥用处... */
  179. BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);
  180. /* 对于epoll来讲, 目前唯一有效的FLAG就是CLOEXEC */
  181. if (flags & ~EPOLL_CLOEXEC)
  182. return -EINVAL;
  183. /*
  184. * Create the internal data structure ("struct eventpoll").
  185. */
  186. /* 分配一个struct eventpoll, 分配和初始化细节我们随后深聊~ */
  187. error = ep_alloc(&ep);
  188. if (error < 0)
  189. return error;
  190. /*
  191. * Creates all the items needed to setup an eventpoll file. That is,
  192. * a file structure and a free file descriptor.
  193. */
  194. /* 这里是创建一个匿名fd, 说起来就话长了...长话短说:
  195. * epollfd本身并不存在一个真正的文件与之对应, 所以内核需要创建一个
  196. * "虚拟"的文件, 并为之分配真正的struct file结构, 而且有真正的fd.
  197. * 这里2个参数比较关键:
  198. * eventpoll_fops, fops就是file operations, 就是当你对这个文件(这里是虚拟的)进行操作(比如读)时,
  199. * fops里面的函数指针指向真正的操作实现, 类似C++里面虚函数和子类的概念.
  200. * epoll只实现了poll和release(就是close)操作, 其它文件系统操作都有VFS全权处理了.
  201. * ep, ep就是struct epollevent, 它会作为一个私有数据保存在struct file的private指针里面.
  202. * 其实说白了, 就是为了能通过fd找到struct file, 通过struct file能找到eventpoll结构.
  203. * 如果懂一点Linux下字符设备驱动开发, 这里应该是很好理解的,
  204. * 推荐阅读 <Linux device driver 3rd>
  205. */
  206. error = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep,
  207. O_RDWR | (flags & O_CLOEXEC));
  208. if (error < 0)
  209. ep_free(ep);
  210. return error;
  211. }
  212. /*
  213. * 创建好epollfd后, 接下来我们要往里面添加fd咯
  214. * 来看epoll_ctl
  215. * epfd 就是epollfd
  216. * op ADD,MOD,DEL
  217. * fd 需要监听的描述符
  218. * event 我们关心的events
  219. */
  220. SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
  221. struct epoll_event __user *, event)
  222. {
  223. int error;
  224. struct file *file, *tfile;
  225. struct eventpoll *ep;
  226. struct epitem *epi;
  227. struct epoll_event epds;
  228. error = -EFAULT;
  229. /*
  230. * 错误处理以及从用户空间将epoll_event结构copy到内核空间.
  231. */
  232. if (ep_op_has_event(op) &&
  233. copy_from_user(&epds, event, sizeof(struct epoll_event)))
  234. goto error_return;
  235. /* Get the "struct file *" for the eventpoll file */
  236. /* 取得struct file结构, epfd既然是真正的fd, 那么内核空间
  237. * 就会有与之对于的一个struct file结构
  238. * 这个结构在epoll_create1()中, 由函数anon_inode_getfd()分配 */
  239. error = -EBADF;
  240. file = fget(epfd);
  241. if (!file)
  242. goto error_return;
  243. /* Get the "struct file *" for the target file */
  244. /* 我们需要监听的fd, 它当然也有个struct file结构, 上下2个不要搞混了哦 */
  245. tfile = fget(fd);
  246. if (!tfile)
  247. goto error_fput;
  248. /* The target file descriptor must support poll */
  249. error = -EPERM;
  250. /* 如果监听的文件不支持poll, 那就没辙了.
  251. * 你知道什么情况下, 文件会不支持poll吗?
  252. */
  253. if (!tfile->f_op || !tfile->f_op->poll)
  254. goto error_tgt_fput;
  255. /*
  256. * We have to check that the file structure underneath the file descriptor
  257. * the user passed to us _is_ an eventpoll file. And also we do not permit
  258. * adding an epoll file descriptor inside itself.
  259. */
  260. error = -EINVAL;
  261. /* epoll不能自己监听自己... */
  262. if (file == tfile || !is_file_epoll(file))
  263. goto error_tgt_fput;
  264. /*
  265. * At this point it is safe to assume that the "private_data" contains
  266. * our own data structure.
  267. */
  268. /* 取到我们的eventpoll结构, 来自与epoll_create1()中的分配 */
  269. ep = file->private_data;
  270. /* 接下来的操作有可能修改数据结构内容, 锁之~ */
  271. mutex_lock(&ep->mtx);
  272. /*
  273. * Try to lookup the file inside our RB tree, Since we grabbed "mtx"
  274. * above, we can be sure to be able to use the item looked up by
  275. * ep_find() till we release the mutex.
  276. */
  277. /* 对于每一个监听的fd, 内核都有分片一个epitem结构,
  278. * 而且我们也知道, epoll是不允许重复添加fd的,
  279. * 所以我们首先查找该fd是不是已经存在了.
  280. * ep_find()其实就是RBTREE查找, 跟C++STL的map差不多一回事, O(lgn)的时间复杂度.
  281. */
  282. epi = ep_find(ep, tfile, fd);
  283. error = -EINVAL;
  284. switch (op) {
  285. /* 首先我们关心添加 */
  286. case EPOLL_CTL_ADD:
  287. if (!epi) {
  288. /* 之前的find没有找到有效的epitem, 证明是第一次插入, 接受!
  289. * 这里我们可以知道, POLLERR和POLLHUP事件内核总是会关心的
  290. * */
  291. epds.events |= POLLERR | POLLHUP;
  292. /* rbtree插入, 详情见ep_insert()的分析
  293. * 其实我觉得这里有insert的话, 之前的find应该
  294. * 是可以省掉的... */
  295. error = ep_insert(ep, &epds, tfile, fd);
  296. } else
  297. /* 找到了!? 重复添加! */
  298. error = -EEXIST;
  299. break;
  300. /* 删除和修改操作都比较简单 */
  301. case EPOLL_CTL_DEL:
  302. if (epi)
  303. error = ep_remove(ep, epi);
  304. else
  305. error = -ENOENT;
  306. break;
  307. case EPOLL_CTL_MOD:
  308. if (epi) {
  309. epds.events |= POLLERR | POLLHUP;
  310. error = ep_modify(ep, epi, &epds);
  311. } else
  312. error = -ENOENT;
  313. break;
  314. }
  315. mutex_unlock(&ep->mtx);
  316. error_tgt_fput:
  317. fput(tfile);
  318. error_fput:
  319. fput(file);
  320. error_return:
  321. return error;
  322. }
  323. /* 分配一个eventpoll结构 */
  324. static int ep_alloc(struct eventpoll **pep)
  325. {
  326. int error;
  327. struct user_struct *user;
  328. struct eventpoll *ep;
  329. /* 获取当前用户的一些信息, 比如是不是root啦, 最大监听fd数目啦 */
  330. user = get_current_user();
  331. error = -ENOMEM;
  332. ep = kzalloc(sizeof(*ep), GFP_KERNEL);
  333. if (unlikely(!ep))
  334. goto free_uid;
  335. /* 这些都是初始化啦 */
  336. spin_lock_init(&ep->lock);
  337. mutex_init(&ep->mtx);
  338. init_waitqueue_head(&ep->wq);
  339. init_waitqueue_head(&ep->poll_wait);
  340. INIT_LIST_HEAD(&ep->rdllist);
  341. ep->rbr = RB_ROOT;
  342. ep->ovflist = EP_UNACTIVE_PTR;
  343. ep->user = user;
  344. *pep = ep;
  345. return 0;
  346. free_uid:
  347. free_uid(user);
  348. return error;
  349. }
  350. /*
  351. * Must be called with "mtx" held.
  352. */
  353. /*
  354. * ep_insert()在epoll_ctl()中被调用, 完成往epollfd里面添加一个监听fd的工作
  355. * tfile是fd在内核态的struct file结构
  356. */
  357. static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
  358. struct file *tfile, int fd)
  359. {
  360. int error, revents, pwake = 0;
  361. unsigned long flags;
  362. struct epitem *epi;
  363. struct ep_pqueue epq;
  364. /* 查看是否达到当前用户的最大监听数 */
  365. if (unlikely(atomic_read(&ep->user->epoll_watches) >=
  366. max_user_watches))
  367. return -ENOSPC;
  368. /* 从著名的slab中分配一个epitem */
  369. if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
  370. return -ENOMEM;
  371. /* Item initialization follow here ... */
  372. /* 这些都是相关成员的初始化... */
  373. INIT_LIST_HEAD(&epi->rdllink);
  374. INIT_LIST_HEAD(&epi->fllink);
  375. INIT_LIST_HEAD(&epi->pwqlist);
  376. epi->ep = ep;
  377. /* 这里保存了我们需要监听的文件fd和它的file结构 */
  378. ep_set_ffd(&epi->ffd, tfile, fd);
  379. epi->event = *event;
  380. epi->nwait = 0;
  381. /* 这个指针的初值不是NULL哦... */
  382. epi->next = EP_UNACTIVE_PTR;
  383. /* Initialize the poll table using the queue callback */
  384. /* 好, 我们终于要进入到poll的正题了 */
  385. epq.epi = epi;
  386. /* 初始化一个poll_table
  387. * 其实就是指定调用poll_wait(注意不是epoll_wait!!!)时的回调函数,和我们关心哪些events,
  388. * ep_ptable_queue_proc()就是我们的回调啦, 初值是所有event都关心 */
  389. init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
  390. /*
  391. * Attach the item to the poll hooks and get current event bits.
  392. * We can safely use the file* here because its usage count has
  393. * been increased by the caller of this function. Note that after
  394. * this operation completes, the poll callback can start hitting
  395. * the new item.
  396. */
  397. /* 这一部很关键, 也比较难懂, 完全是内核的poll机制导致的...
  398. * 首先, f_op->poll()一般来说只是个wrapper, 它会调用真正的poll实现,
  399. * 拿UDP的socket来举例, 这里就是这样的调用流程: f_op->poll(), sock_poll(),
  400. * udp_poll(), datagram_poll(), sock_poll_wait(), 最后调用到我们上面指定的
  401. * ep_ptable_queue_proc()这个回调函数...(好深的调用路径...).
  402. * 完成这一步, 我们的epitem就跟这个socket关联起来了, 当它有状态变化时,
  403. * 会通过ep_poll_callback()来通知.
  404. * 最后, 这个函数还会查询当前的fd是不是已经有啥event已经ready了, 有的话
  405. * 会将event返回. */
  406. revents = tfile->f_op->poll(tfile, &epq.pt);
  407. /*
  408. * We have to check if something went wrong during the poll wait queue
  409. * install process. Namely an allocation for a wait queue failed due
  410. * high memory pressure.
  411. */
  412. error = -ENOMEM;
  413. if (epi->nwait < 0)
  414. goto error_unregister;
  415. /* Add the current item to the list of active epoll hook for this file */
  416. /* 这个就是每个文件会将所有监听自己的epitem链起来 */
  417. spin_lock(&tfile->f_lock);
  418. list_add_tail(&epi->fllink, &tfile->f_ep_links);
  419. spin_unlock(&tfile->f_lock);
  420. /*
  421. * Add the current item to the RB tree. All RB tree operations are
  422. * protected by "mtx", and ep_insert() is called with "mtx" held.
  423. */
  424. /* 都搞定后, 将epitem插入到对应的eventpoll中去 */
  425. ep_rbtree_insert(ep, epi);
  426. /* We have to drop the new item inside our item list to keep track of it */
  427. spin_lock_irqsave(&ep->lock, flags);
  428. /* If the file is already "ready" we drop it inside the ready list */
  429. /* 到达这里后, 如果我们监听的fd已经有事件发生, 那就要处理一下 */
  430. if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
  431. /* 将当前的epitem加入到ready list中去 */
  432. list_add_tail(&epi->rdllink, &ep->rdllist);
  433. /* Notify waiting tasks that events are available */
  434. /* 谁在epoll_wait, 就唤醒它... */
  435. if (waitqueue_active(&ep->wq))
  436. wake_up_locked(&ep->wq);
  437. /* 谁在epoll当前的epollfd, 也唤醒它... */
  438. if (waitqueue_active(&ep->poll_wait))
  439. pwake++;
  440. }
  441. spin_unlock_irqrestore(&ep->lock, flags);
  442. atomic_inc(&ep->user->epoll_watches);
  443. /* We have to call this outside the lock */
  444. if (pwake)
  445. ep_poll_safewake(&ep->poll_wait);
  446. return 0;
  447. error_unregister:
  448. ep_unregister_pollwait(ep, epi);
  449. /*
  450. * We need to do this because an event could have been arrived on some
  451. * allocated wait queue. Note that we don‘t care about the ep->ovflist
  452. * list, since that is used/cleaned only inside a section bound by "mtx".
  453. * And ep_insert() is called with "mtx" held.
  454. */
  455. spin_lock_irqsave(&ep->lock, flags);
  456. if (ep_is_linked(&epi->rdllink))
  457. list_del_init(&epi->rdllink);
  458. spin_unlock_irqrestore(&ep->lock, flags);
  459. kmem_cache_free(epi_cache, epi);
  460. return error;
  461. }
  462. /*
  463. * This is the callback that is used to add our wait queue to the
  464. * target file wakeup lists.
  465. */
  466. /*
  467. * 该函数在调用f_op->poll()时会被调用.
  468. * 也就是epoll主动poll某个fd时, 用来将epitem与指定的fd关联起来的.
  469. * 关联的办法就是使用等待队列(waitqueue)
  470. */
  471. static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
  472. poll_table *pt)
  473. {
  474. struct epitem *epi = ep_item_from_epqueue(pt);
  475. struct eppoll_entry *pwq;
  476. if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
  477. /* 初始化等待队列, 指定ep_poll_callback为唤醒时的回调函数,
  478. * 当我们监听的fd发生状态改变时, 也就是队列头被唤醒时,
  479. * 指定的回调函数将会被调用. */
  480. init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
  481. pwq->whead = whead;
  482. pwq->base = epi;
  483. /* 将刚分配的等待队列成员加入到头中, 头是由fd持有的 */
  484. add_wait_queue(whead, &pwq->wait);
  485. list_add_tail(&pwq->llink, &epi->pwqlist);
  486. /* nwait记录了当前epitem加入到了多少个等待队列中,
  487. * 我认为这个值最大也只会是1... */
  488. epi->nwait++;
  489. } else {
  490. /* We have to signal that an error occurred */
  491. epi->nwait = -1;
  492. }
  493. }
  494. /*
  495. * This is the callback that is passed to the wait queue wakeup
  496. * machanism. It is called by the stored file descriptors when they
  497. * have events to report.
  498. */
  499. /*
  500. * 这个是关键性的回调函数, 当我们监听的fd发生状态改变时, 它会被调用.
  501. * 参数key被当作一个unsigned long整数使用, 携带的是events.
  502. */
  503. static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
  504. {
  505. int pwake = 0;
  506. unsigned long flags;
  507. struct epitem *epi = ep_item_from_wait(wait);//从等待队列获取epitem.需要知道哪个进程挂载到这个设备
  508. struct eventpoll *ep = epi->ep;//获取
  509. spin_lock_irqsave(&ep->lock, flags);
  510. /*
  511. * If the event mask does not contain any poll(2) event, we consider the
  512. * descriptor to be disabled. This condition is likely the effect of the
  513. * EPOLLONESHOT bit that disables the descriptor when an event is received,
  514. * until the next EPOLL_CTL_MOD will be issued.
  515. */
  516. if (!(epi->event.events & ~EP_PRIVATE_BITS))
  517. goto out_unlock;
  518. /*
  519. * Check the events coming with the callback. At this stage, not
  520. * every device reports the events in the "key" parameter of the
  521. * callback. We need to be able to handle both cases here, hence the
  522. * test for "key" != NULL before the event match test.
  523. */
  524. /* 没有我们关心的event... */
  525. if (key && !((unsigned long) key & epi->event.events))
  526. goto out_unlock;
  527. /*
  528. * If we are trasfering events to userspace, we can hold no locks
  529. * (because we‘re accessing user memory, and because of linux f_op->poll()
  530. * semantics). All the events that happens during that period of time are
  531. * chained in ep->ovflist and requeued later on.
  532. */
  533. /*
  534. * 这里看起来可能有点费解, 其实干的事情比较简单:
  535. * 如果该callback被调用的同时, epoll_wait()已经返回了,
  536. * 也就是说, 此刻应用程序有可能已经在循环获取events,
  537. * 这种情况下, 内核将此刻发生event的epitem用一个单独的链表
  538. * 链起来, 不发给应用程序, 也不丢弃, 而是在下一次epoll_wait
  539. * 时返回给用户.
  540. */
  541. if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
  542. if (epi->next == EP_UNACTIVE_PTR) {
  543. epi->next = ep->ovflist;
  544. ep->ovflist = epi;
  545. }
  546. goto out_unlock;
  547. }
  548. /* If this file is already in the ready list we exit soon */
  549. /* 将当前的epitem放入ready list */
  550. if (!ep_is_linked(&epi->rdllink))
  551. list_add_tail(&epi->rdllink, &ep->rdllist);
  552. /*
  553. * Wake up ( if active ) both the eventpoll wait list and the ->poll()
  554. * wait list.
  555. */
  556. /* 唤醒epoll_wait... */
  557. if (waitqueue_active(&ep->wq))
  558. wake_up_locked(&ep->wq);
  559. /* 如果epollfd也在被poll, 那就唤醒队列里面的所有成员. */
  560. if (waitqueue_active(&ep->poll_wait))
  561. pwake++;
  562. out_unlock:
  563. spin_unlock_irqrestore(&ep->lock, flags);
  564. /* We have to call this outside the lock */
  565. if (pwake)
  566. ep_poll_safewake(&ep->poll_wait);
  567. return 1;
  568. }
  569. /*
  570. * Implement the event wait interface for the eventpoll file. It is the kernel
  571. * part of the user space epoll_wait(2).
  572. */
  573. SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
  574. int, maxevents, int, timeout)
  575. {
  576. int error;
  577. struct file *file;
  578. struct eventpoll *ep;
  579. /* The maximum number of event must be greater than zero */
  580. if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
  581. return -EINVAL;
  582. /* Verify that the area passed by the user is writeable */
  583. /* 这个地方有必要说明一下:
  584. * 内核对应用程序采取的策略是"绝对不信任",
  585. * 所以内核跟应用程序之间的数据交互大都是copy, 不允许(也时候也是不能...)指针引用.
  586. * epoll_wait()需要内核返回数据给用户空间, 内存由用户程序提供,
  587. * 所以内核会用一些手段来验证这一段内存空间是不是有效的.
  588. */
  589. if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event))) {
  590. error = -EFAULT;
  591. goto error_return;
  592. }
  593. /* Get the "struct file *" for the eventpoll file */
  594. error = -EBADF;
  595. /* 获取epollfd的struct file, epollfd也是文件嘛 */
  596. file = fget(epfd);
  597. if (!file)
  598. goto error_return;
  599. /*
  600. * We have to check that the file structure underneath the fd
  601. * the user passed to us _is_ an eventpoll file.
  602. */
  603. error = -EINVAL;
  604. /* 检查一下它是不是一个真正的epollfd... */
  605. if (!is_file_epoll(file))
  606. goto error_fput;
  607. /*
  608. * At this point it is safe to assume that the "private_data" contains
  609. * our own data structure.
  610. */
  611. /* 获取eventpoll结构 */
  612. ep = file->private_data;
  613. /* Time to fish for events ... */
  614. /* OK, 睡觉, 等待事件到来~~ */
  615. error = ep_poll(ep, events, maxevents, timeout);
  616. error_fput:
  617. fput(file);
  618. error_return:
  619. return error;
  620. }
  621. /* 这个函数真正将执行epoll_wait的进程带入睡眠状态... */
  622. static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
  623. int maxevents, long timeout)
  624. {
  625. int res, eavail;
  626. unsigned long flags;
  627. long jtimeout;
  628. wait_queue_t wait;//等待队列
  629. /*
  630. * Calculate the timeout by checking for the "infinite" value (-1)
  631. * and the overflow condition. The passed timeout is in milliseconds,
  632. * that why (t * HZ) / 1000.
  633. */
  634. /* 计算睡觉时间, 毫秒要转换为HZ */
  635. jtimeout = (timeout < 0 || timeout >= EP_MAX_MSTIMEO) ?
  636. MAX_SCHEDULE_TIMEOUT : (timeout * HZ + 999) / 1000;
  637. retry:
  638. spin_lock_irqsave(&ep->lock, flags);
  639. res = 0;
  640. /* 如果ready list不为空, 就不睡了, 直接干活... */
  641. if (list_empty(&ep->rdllist)) {
  642. /*
  643. * We don‘t have any available event to return to the caller.
  644. * We need to sleep here, and we will be wake up by
  645. * ep_poll_callback() when events will become available.
  646. */
  647. /* OK, 初始化一个等待队列, 准备直接把自己挂起,
  648. * 注意current是一个宏, 代表当前进程 */
  649. init_waitqueue_entry(&wait, current);//初始化等待队列,wait表示当前进程
  650. __add_wait_queue_exclusive(&ep->wq, &wait);//挂载到ep结构的等待队列
  651. for (;;) {
  652. /*
  653. * We don‘t want to sleep if the ep_poll_callback() sends us
  654. * a wakeup in between. That‘s why we set the task state
  655. * to TASK_INTERRUPTIBLE before doing the checks.
  656. */
  657. /* 将当前进程设置位睡眠, 但是可以被信号唤醒的状态,
  658. * 注意这个设置是"将来时", 我们此刻还没睡! */
  659. set_current_state(TASK_INTERRUPTIBLE);
  660. /* 如果这个时候, ready list里面有成员了,
  661. * 或者睡眠时间已经过了, 就直接不睡了... */
  662. if (!list_empty(&ep->rdllist) || !jtimeout)
  663. break;
  664. /* 如果有信号产生, 也起床... */
  665. if (signal_pending(current)) {
  666. res = -EINTR;
  667. break;
  668. }
  669. /* 啥事都没有,解锁, 睡觉... */
  670. spin_unlock_irqrestore(&ep->lock, flags);
  671. /* jtimeout这个时间后, 会被唤醒,
  672. * ep_poll_callback()如果此时被调用,
  673. * 那么我们就会直接被唤醒, 不用等时间了...
  674. * 再次强调一下ep_poll_callback()的调用时机是由被监听的fd
  675. * 的具体实现, 比如socket或者某个设备驱动来决定的,
  676. * 因为等待队列头是他们持有的, epoll和当前进程
  677. * 只是单纯的等待...
  678. **/
  679. jtimeout = schedule_timeout(jtimeout);//睡觉
  680. spin_lock_irqsave(&ep->lock, flags);
  681. }
  682. __remove_wait_queue(&ep->wq, &wait);
  683. /* OK 我们醒来了... */
  684. set_current_state(TASK_RUNNING);
  685. }
  686. /* Is it worth to try to dig for events ? */
  687. eavail = !list_empty(&ep->rdllist) || ep->ovflist != EP_UNACTIVE_PTR;
  688. spin_unlock_irqrestore(&ep->lock, flags);
  689. /*
  690. * Try to transfer events to user space. In case we get 0 events and
  691. * there‘s still timeout left over, we go trying again in search of
  692. * more luck.
  693. */
  694. /* 如果一切正常, 有event发生, 就开始准备数据copy给用户空间了... */
  695. if (!res && eavail &&
  696. !(res = ep_send_events(ep, events, maxevents)) && jtimeout)
  697. goto retry;
  698. return res;
  699. }
  700. /* 这个简单, 我们直奔下一个... */
  701. static int ep_send_events(struct eventpoll *ep,
  702. struct epoll_event __user *events, int maxevents)
  703. {
  704. struct ep_send_events_data esed;
  705. esed.maxevents = maxevents;
  706. esed.events = events;
  707. return ep_scan_ready_list(ep, ep_send_events_proc, &esed);
  708. }
  709. /**
  710. * ep_scan_ready_list - Scans the ready list in a way that makes possible for
  711. * the scan code, to call f_op->poll(). Also allows for
  712. * O(NumReady) performance.
  713. *
  714. * @ep: Pointer to the epoll private data structure.
  715. * @sproc: Pointer to the scan callback.
  716. * @priv: Private opaque data passed to the @sproc callback.
  717. *
  718. * Returns: The same integer error code returned by the @sproc callback.
  719. */
  720. static int ep_scan_ready_list(struct eventpoll *ep,
  721. int (*sproc)(struct eventpoll *,
  722. struct list_head *, void *),
  723. void *priv)
  724. {
  725. int error, pwake = 0;
  726. unsigned long flags;
  727. struct epitem *epi, *nepi;
  728. LIST_HEAD(txlist);
  729. /*
  730. * We need to lock this because we could be hit by
  731. * eventpoll_release_file() and epoll_ctl().
  732. */
  733. mutex_lock(&ep->mtx);
  734. /*
  735. * Steal the ready list, and re-init the original one to the
  736. * empty list. Also, set ep->ovflist to NULL so that events
  737. * happening while looping w/out locks, are not lost. We cannot
  738. * have the poll callback to queue directly on ep->rdllist,
  739. * because we want the "sproc" callback to be able to do it
  740. * in a lockless way.
  741. */
  742. spin_lock_irqsave(&ep->lock, flags);
  743. /* 这一步要注意, 首先, 所有监听到events的epitem都链到rdllist上了,
  744. * 但是这一步之后, 所有的epitem都转移到了txlist上, 而rdllist被清空了,
  745. * 要注意哦, rdllist已经被清空了! */
  746. list_splice_init(&ep->rdllist, &txlist);
  747. /* ovflist, 在ep_poll_callback()里面我解释过, 此时此刻我们不希望
  748. * 有新的event加入到ready list中了, 保存后下次再处理... */
  749. ep->ovflist = NULL;
  750. spin_unlock_irqrestore(&ep->lock, flags);
  751. /*
  752. * Now call the callback function.
  753. */
  754. /* 在这个回调函数里面处理每个epitem
  755. * sproc 就是 ep_send_events_proc, 下面会注释到. */
  756. error = (*sproc)(ep, &txlist, priv);
  757. spin_lock_irqsave(&ep->lock, flags);
  758. /*
  759. * During the time we spent inside the "sproc" callback, some
  760. * other events might have been queued by the poll callback.
  761. * We re-insert them inside the main ready-list here.
  762. */
  763. /* 现在我们来处理ovflist, 这些epitem都是我们在传递数据给用户空间时
  764. * 监听到了事件. */
  765. for (nepi = ep->ovflist; (epi = nepi) != NULL;
  766. nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
  767. /*
  768. * We need to check if the item is already in the list.
  769. * During the "sproc" callback execution time, items are
  770. * queued into ->ovflist but the "txlist" might already
  771. * contain them, and the list_splice() below takes care of them.
  772. */
  773. /* 将这些直接放入readylist */
  774. if (!ep_is_linked(&epi->rdllink))
  775. list_add_tail(&epi->rdllink, &ep->rdllist);
  776. }
  777. /*
  778. * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
  779. * releasing the lock, events will be queued in the normal way inside
  780. * ep->rdllist.
  781. */
  782. ep->ovflist = EP_UNACTIVE_PTR;
  783. /*
  784. * Quickly re-inject items left on "txlist".
  785. */
  786. /* 上一次没有处理完的epitem, 重新插入到ready list */
  787. list_splice(&txlist, &ep->rdllist);
  788. /* ready list不为空, 直接唤醒... */
  789. if (!list_empty(&ep->rdllist)) {
  790. /*
  791. * Wake up (if active) both the eventpoll wait list and
  792. * the ->poll() wait list (delayed after we release the lock).
  793. */
  794. if (waitqueue_active(&ep->wq))
  795. wake_up_locked(&ep->wq);
  796. if (waitqueue_active(&ep->poll_wait))
  797. pwake++;
  798. }
  799. spin_unlock_irqrestore(&ep->lock, flags);
  800. mutex_unlock(&ep->mtx);
  801. /* We have to call this outside the lock */
  802. if (pwake)
  803. ep_poll_safewake(&ep->poll_wait);
  804. return error;
  805. }
  806. /* 该函数作为callbakc在ep_scan_ready_list()中被调用
  807. * head是一个链表, 包含了已经ready的epitem,
  808. * 这个不是eventpoll里面的ready list, 而是上面函数中的txlist.
  809. */
  810. static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
  811. void *priv)
  812. {
  813. struct ep_send_events_data *esed = priv;
  814. int eventcnt;
  815. unsigned int revents;
  816. struct epitem *epi;
  817. struct epoll_event __user *uevent;
  818. /*
  819. * We can loop without lock because we are passed a task private list.
  820. * Items cannot vanish during the loop because ep_scan_ready_list() is
  821. * holding "mtx" during this call.
  822. */
  823. /* 扫描整个链表... */
  824. for (eventcnt = 0, uevent = esed->events;
  825. !list_empty(head) && eventcnt < esed->maxevents;) {
  826. /* 取出第一个成员 */
  827. epi = list_first_entry(head, struct epitem, rdllink);
  828. /* 然后从链表里面移除 */
  829. list_del_init(&epi->rdllink);
  830. /* 读取events,
  831. * 注意events我们ep_poll_callback()里面已经取过一次了, 为啥还要再取?
  832. * 1. 我们当然希望能拿到此刻的最新数据, events是会变的~
  833. * 2. 不是所有的poll实现, 都通过等待队列传递了events, 有可能某些驱动压根没传
  834. * 必须主动去读取. */
  835. revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL) &
  836. epi->event.events;
  837. /*
  838. * If the event mask intersect the caller-requested one,
  839. * deliver the event to userspace. Again, ep_scan_ready_list()
  840. * is holding "mtx", so no operations coming from userspace
  841. * can change the item.
  842. */
  843. if (revents) {
  844. /* 将当前的事件和用户传入的数据都copy给用户空间,
  845. * 就是epoll_wait()后应用程序能读到的那一堆数据. */
  846. if (__put_user(revents, &uevent->events) ||
  847. __put_user(epi->event.data, &uevent->data)) {
  848. /* 如果copy过程中发生错误, 会中断链表的扫描,
  849. * 并把当前发生错误的epitem重新插入到ready list.
  850. * 剩下的没处理的epitem也不会丢弃, 在ep_scan_ready_list()
  851. * 中它们也会被重新插入到ready list */
  852. list_add(&epi->rdllink, head);
  853. return eventcnt ? eventcnt : -EFAULT;
  854. }
  855. eventcnt++;
  856. uevent++;
  857. if (epi->event.events & EPOLLONESHOT)
  858. epi->event.events &= EP_PRIVATE_BITS;
  859. else if (!(epi->event.events & EPOLLET)) {
  860. /*
  861. * If this file has been added with Level
  862. * Trigger mode, we need to insert back inside
  863. * the ready list, so that the next call to
  864. * epoll_wait() will check again the events
  865. * availability. At this point, noone can insert
  866. * into ep->rdllist besides us. The epoll_ctl()
  867. * callers are locked out by
  868. * ep_scan_ready_list() holding "mtx" and the
  869. * poll callback will queue them in ep->ovflist.
  870. */
  871. /* 嘿嘿, EPOLLET和非ET的区别就在这一步之差呀~
  872. * 如果是ET, epitem是不会再进入到readly list,
  873. * 除非fd再次发生了状态改变, ep_poll_callback被调用.
  874. * 如果是非ET, 不管你还有没有有效的事件或者数据,
  875. * 都会被重新插入到ready list, 再下一次epoll_wait
  876. * 时, 会立即返回, 并通知给用户空间. 当然如果这个
  877. * 被监听的fds确实没事件也没数据了, epoll_wait会返回一个0,
  878. * 空转一次.
  879. */
  880. list_add_tail(&epi->rdllink, &ep->rdllist);
  881. }
  882. }
  883. }
  884. return eventcnt;
  885. }
  886. /* ep_free在epollfd被close时调用,
  887. * 释放一些资源而已, 比较简单 */
  888. static void ep_free(struct eventpoll *ep)
  889. {
  890. struct rb_node *rbp;
  891. struct epitem *epi;
  892. /* We need to release all tasks waiting for these file */
  893. if (waitqueue_active(&ep->poll_wait))
  894. ep_poll_safewake(&ep->poll_wait);
  895. /*
  896. * We need to lock this because we could be hit by
  897. * eventpoll_release_file() while we‘re freeing the "struct eventpoll".
  898. * We do not need to hold "ep->mtx" here because the epoll file
  899. * is on the way to be removed and no one has references to it
  900. * anymore. The only hit might come from eventpoll_release_file() but
  901. * holding "epmutex" is sufficent here.
  902. */
  903. mutex_lock(&epmutex);
  904. /*
  905. * Walks through the whole tree by unregistering poll callbacks.
  906. */
  907. for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) {
  908. epi = rb_entry(rbp, struct epitem, rbn);
  909. ep_unregister_pollwait(ep, epi);
  910. }
  911. /*
  912. * Walks through the whole tree by freeing each "struct epitem". At this
  913. * point we are sure no poll callbacks will be lingering around, and also by
  914. * holding "epmutex" we can be sure that no file cleanup code will hit
  915. * us during this operation. So we can avoid the lock on "ep->lock".
  916. */
  917. /* 之所以在关闭epollfd之前不需要调用epoll_ctl移除已经添加的fd,
  918. * 是因为这里已经做了... */
  919. while ((rbp = rb_first(&ep->rbr)) != NULL) {
  920. epi = rb_entry(rbp, struct epitem, rbn);
  921. ep_remove(ep, epi);
  922. }
  923. mutex_unlock(&epmutex);
  924. mutex_destroy(&ep->mtx);
  925. free_uid(ep->user);
  926. kfree(ep);
  927. }
  928. /* File callbacks that implement the eventpoll file behaviour */
  929. static const struct file_operations eventpoll_fops = {
  930. .release = ep_eventpoll_release,
  931. .poll = ep_eventpoll_poll
  932. };
  933. /* Fast test to see if the file is an evenpoll file */
  934. static inline int is_file_epoll(struct file *f)
  935. {
  936. return f->f_op == &eventpoll_fops;
  937. }
  938. /* OK, eventpoll我认为比较重要的函数都注释完了... */

来自为知笔记(Wiz)

时间: 2024-10-10 21:36:00

Linux之epoll实现的相关文章

Python——在Python中如何使用Linux的epoll

在Python中如何使用Linux的epoll 目录 序言 阻塞socket编程示例 异步socket的好处以及Linux epoll 带epoll的异步socket编程示例 性能注意事项 源代码 序言 从2.6开始,Python包含了访问Linux epoll库的API.这篇文章用几个简单的python 3例子来展示下这个API.欢迎大家质疑和反馈. 阻塞socket编程示例 示例1用python3.0搭建了一个简单的服务:在8080端口监听HTTP请求,把它打印到控制台,并返回一个HTTP响

linux下epoll机制实现多用户并发连接

linux下epoll机制我就不再阐述了,网上找了好多资料和例子,发现和我想要的功能完全不一样,所以就自己写了一个. 实现的功能是,有很多客户端同时连接服务器,例如,S为服务器,有客户端A和客户端B要连接服务器,他们都需要验证密码,若A先连接服务器,此时不输入密码:B再连接服务器,此时,B输入密码的话,服务器不会响应,必须等A先验证完,才可以.我给的例子,可以实现:若A连接了服务器而没有输入密码,B再连接,这是服务器可以为B生成一个线程,确认B的密码后,将客户端的client_sockfd加入到

高性能网络服务器编程:为什么linux下epoll是最好,Netty要比NIO.2好?

基本的IO编程过程(包括网络IO和文件IO)是,打开文件描述符(windows是handler,java是stream或channel),多路捕获(Multiplexe,即select和poll和epoll)IO可读写的状态,而后可以读写的文件描述符进行IO读写,由于IO设备速度和CPU内存比速度会慢,为了更好的利用CPU和内存,会开多线程,每个线程读写一个文件描述符. 但C10K问题,让我们意识到在超大数量的网络连接下,机器设备和网络速度不再是瓶颈,瓶颈在于操作系统和IO应用程序的沟通协作的方

linux内核epoll实现分析

epoll与select/poll的区别 select,poll,epoll都是IO多路复用的机制.I/O多路复用就通过一种机制,可以监视多个描述符,一旦某个描述符就绪,能够通知程序进行相应的操作. select的本质是采用32个整数的32位,即32*32= 1024来标识,fd值为1-1024.当fd的值超过1024限制时,就必须修改FD_SETSIZE的大小.这个时候就可以标识32*max值范围的fd. poll与select不同,通过一个pollfd数组向内核传递需要关注的事件,故没有描述

linux 下 epoll 编程

转载自 Linux epoll模型 ,这篇文章讲的非常详细! 定义: epoll是Linux内核为处理大批句柄而作改进的poll,是Linux下多路复用IO接口select/poll的增强版本,它能显著的减少程序在大量并发连接中只有少量活跃的情况下的系统CPU利用率.因为它会复用文件描述符集合来传递结果而不是迫使开发者每次等待事件之前都必须重新准备要被侦听的文件描述符集合,另一个原因就是获取事件的时候,它无须遍历整个被侦听的描述符集,只要遍历那些被内核IO事件异步唤醒而加入Ready队列的描述符

Linux下EPoll通信模型简析

EPoll基于I/O的事件通知机制,由系统通知用户那些SOCKET触发了那些相关I/O事件,事件中包含对应的文件描述符以及事件类型,这样应用程序可以针对事件以及事件的source做相应的处理(Acception,Read,Write,Error).相比原先的SELECT模型(用户主动依次检查SOCKET),变成被动等待系统告知处于活跃状态的SOCKET,性能提升不少(不需要依次遍历所有的SOCKET,而只是对活跃SOCKET进行事件处理). 基本步骤: 擅长对大量并发用户的请求进行及时处理,完成

Linux中epoll+线程池实现高并发

服务器并发模型通常可分为单线程和多线程模型,这里的线程通常是指"I/O线程",即负责I/O操作,协调分配任务的"管理线程",而实际的请求和任务通常交由所谓"工作者线程"处理.通常多线程模型下,每个线程既是I/O线程又是工作者线程.所以这里讨论的是,单I/O线程+多工作者线程的模型,这也是最常用的一种服务器并发模型.我所在的项目中的server代码中,这种模型随处可见.它还有个名字,叫"半同步/半异步"模型,同时,这种模型也是生

Windows完成端口与Linux epoll技术简介

收藏自:http://www.cnblogs.com/cr0-3/archive/2011/09/09/2172280.html WINDOWS完成端口编程1.基本概念2.WINDOWS完成端口的特点3.完成端口(Completion Ports )相关数据结构和创建4.完成端口线程的工作原理5.Windows完成端口的实例代码Linux的EPoll模型1.为什么select落后2.内核中提高I/O性能的新方法epoll3.epoll的优点4.epoll的工作模式 5.epoll的使用方法6.L

linux epoll机制对TCP 客户端和服务端的监听C代码通用框架实现

1 TCP简介 tcp是一种基于流的应用层协议,其"可靠的数据传输"实现的原理就是,"拥塞控制"的滑动窗口机制,该机制包含的算法主要有"慢启动","拥塞避免","快速重传". 2 TCP socket建立和epoll监听实现 数据结构设计 linux环境下,应用层TCP消息体定义如下: typedef struct TcpMsg_s { TcpMsgHeader head; void* msg; }TcpM