poll

Marek‘s

totally not insane

idea of the day

Epoll is fundamentally broken 1/2

I/O multiplexing part #3

20 February 2017

https://idea.popcount.org/

In previous articles we talked about:

This time we‘ll focus on Linux‘s select(2) successor - the epoll(2) I/O multiplexing syscall.

Epoll is relatively young. It was created by Davide Libenzi in 2002. For comparison: Windows did IOCP in 1994 and FreeBSD‘s kqueue was introduced in July 2000. Unfortunately, even though epoll is the youngest in the advanced IO multiplexing family, it‘s the worse in the bunch.

Comparison with /dev/poll

Bryan Cantrill of Joyent is known for bashing epoll(). Here‘s one of the more entertaining interviews:

He mentions two defects.

First he describes "a fatal flaw, that is subtle" in the Solaris /dev/poll model. He starts by describing the "thundering herd" problem (which we discussed earlier). Then he moves on to the real issue. In a multithreaded scenario, when the /dev/poll descriptor is shared, it is impossible to deliver events on one file descriptor to precisely one worker thread. He explains that band aids to level triggered /dev/poll model and naive edge-triggered won‘t work in multithreaded case1.

This argument is indeed subtle, but since epoll has semantics close to/dev/poll, it‘s safe to say it wasn‘t designed to work in multithreaded scenarios.

In the video Mr Cantrill raised a second argument against epoll: the events registered in epoll aren‘t associated with file descriptor, but with the underlying kernel object referred to by the file descriptor (let‘s call this the filedescription). He mentions the "stunning" effect of forking and closing an fd. We will leave this problem for now and describe it in another blog post.

Why the critique?

Most of the epoll critique is based on two fundamental design issues:

1) Sometimes it is desirable to scale application by using multi threading. This was not supported by early implementations of epoll and was fixed byEPOLLONESHOT and EPOLLEXCLUSIVE flags.

2) Epoll registers the file descripton, the kernel data structure, not file descriptor, the userspace handler pointing to it.

The debate is heated because it‘s technically possible to avoid both pitfalls with careful defensive programming. If you can you should avoid using epoll for load balancing across threads. Avoid sharing epoll file descriptor across threads. Avoid sharing epoll-registered file descriptors. Avoid forking, and if you must: close all epoll-registered file descriptors before calling execve. Explicitly deregister affected file descriptors from epoll set before callingdup/dup2/dup3 or close.

If you have simple code and follow the advice above you might be fine. The problem starts when your epoll program gets complex.

Let‘s dig deeper. In this blog post I‘ll focus on the load balancing argument.

Load balancing

There are two distinct load balancing scenarios:

  • scaling out accept() calls for a single bound TCP socket
  • scaling usual read() calls for large number of connected sockets

Scaling out accept()

Sometimes it‘s necessary to serve lots of very short TCP connections. A high throughput HTTP 1.0 server is one such example. Since the rate of inbound connections is high, you want to distribute the work of accept()ing connections across multiple CPU‘s.

This is a real problem happening in large deployments. Tom Herbert reported an application handling 40k connections per second. With such a volume it does makes sense to spread the work across cores.

But it‘s not that simple. Up until kernel 4.5 it wasn‘t possible to use epoll to scale out accepts.

Level triggered - unnecessary wake-up

A naive solution is to have a single epoll file descriptor shared across worker threads. This won‘t work well, neither will sharing bound socket file descriptor and registering it in each thread to unique epoll instance.

This is because "level triggered" (aka: normal) epoll inherits the "thundering herd" semantics from select(). Without special flags, in level-triggered mode, all the workers will be woken up on each and every new connection. Here‘s an example:

  1. Kernel: Receives a new connection.
  2. Kernel: Notifies two waiting threads A and B. Due to "thundering herd" behavior with level-triggered notifications kernel must wake up both.
  3. Thread A: Finishes epoll_wait().
  4. Thread B: Finishes epoll_wait().
  5. Thread A: Performs accept(), this succeeds.
  6. Thread B: Performs accept(), this fails with EAGAIN.

Waking up "Thread B" was completely unnecessary and wastes precious resources. Epoll in level-triggered mode scales out poorly.

Edge triggered - unnecessary wake-up and starvation

Okay, since we ruled out naive level-triggered setup, maybe "edge triggered" could do better?

Not really. Here is a possible pessimistic run:

  1. Kernel: Receives first connection. Two threads, A and B, are waiting. Due to "edge triggered" behavior only one is notified - let‘s say thread A.
  2. Thread A: Finishes epoll_wait().
  3. Thread A: Performs accept(), this succeeds.
  4. Kernel: The accept queue is empty, the event-triggered socket moved from "readable" to "non readable", so the kernel must re-arm it.
  5. Kernel: Receives a second connection.
  6. Kernel: Only one thread is currently waiting on the epoll_wait(). Kernel wakes up Thread B.
  7. Thread A: Must perform accept() since it does not know if kernel received one or more connections originally. It hopes to get EAGAIN, but gets another socket.
  8. Thread B: Performs accept(), receives EAGAIN. This thread is confused.
  9. Thread A: Must perform accept() again, gets EAGAIN.

The wake-up of Thread B was completely unnecessary and is confusing. Additionally, in edge triggered mode it‘s hard to avoid starvation:

  1. Kernel: Receives two connections. Two threads, A and B, are waiting. Due to "edge triggered" behavior only one is notified - let‘s say thread A.
  2. Thread A: finished epoll_wait().
  3. Thread A: performs accept(), this succeeds
  4. Kernel: Receives third connection. The socket was "readable", still is "readable". Since we are in "edge triggered" mode, no event is emitted.
  5. Thread A: Must perform accept(), hopes to get EGAIN, but gets another socket.
  6. Kernel: Receives fourth connection.
  7. Thread A: Must perform accept(), hopes to get EGAIN, but gets another socket.

In this case the socket moved only once from "non-readable" to "readable" state. Since the socket is in edge-triggered mode, the kernel will wake up epoll exactly once. In this case all the connections will be received by Thread A and load balancing won‘t be achieved.

Correct solution

There are two workarounds.

The best and the only scalable approach is to use recent Kernel 4.5+ and use level-triggered events with EPOLLEXCLUSIVE flag. This will ensure only one thread is woken for an event, avoid "thundering herd" issue and scale properly across multiple CPU‘s

Without EPOLLEXCLUSIVE, similar behavior it can be emulated with edge-triggered and EPOLLONESHOT, at a cost of one extra epoll_ctl() syscall after each event. This will distribute load across multiple CPU‘s properly, but at most one worker will call accept() at a time, limiting throughput.

  1. Kernel: Receives two connections. Two threads, A and B, are waiting. Due to "edge triggered" behavior only one is notified - let‘s say thread A.
  2. Thread A: Finishes epoll_wait().
  3. Thread A: Performs accept(), this succeeds.
  4. Thread A: Performs epoll_ctl(EPOLL_CTL_MOD), this will reset theEPOLLONESHOT and re-arm the socket.

It‘s worth noting there are other ways to scale accept() without relying on epoll. One option is to use SO_REUSEPORT and create multiple listen sockets sharing the same port number. This approach has problems though - when one of the file descriptors is closed, the sockets already waiting in the accept queue will be dropped. Read more in this Yelp blog post and this LWN comment.

Kernel 4.4 introduced SO_INCOMING_CPU to further improve locality ofSO_REUSEPORT sockets. I wasn‘t able to find a good documentation of this very new feature.

Even better, kernel 4.5 introduced SO_ATTACH_REUSEPORT_CBPF andSO_ATTACH_REUSEPORT_EBPF socket options. When used properly, with a bit of magic, it should be possible to substitute SO_INCOMING_CPU and overcome the usual SO_REUSEPORT dropped connections on rebalancing problem.

Scaling out read()

Apart from scaling accept() there is a second use case for scaling epollacross many cores. Imagine a situation when you have a large number of HTTP client connections and you want to serve them as quickly as the data arrives. Each connection may require some unpredictable processing, so sharding them into equal buckets across worker threads will worsen mean latency. It‘s better to use "the combined queue" queuing model - have one epoll set and use multiple threads to pull active sockets and perform the work.

Here‘s The Engineer Guy explaining the combined queue model:

In our case the shared queue is an epoll descriptor, the tills are worker threads and the jobs are readable sockets.

Epoll level triggered

We don‘t want to use the level triggered model due to the "thundering herd" behavior. Additionally the EPOLLEXCLUSIVE won‘t help since there is a race condition possible. Here‘s how it may materialize:

  • Kernel: receives 2047 bytes of data
  • Kernel: two threads are waiting on epoll, kernel wakes up due toEPOLLEXCLUSIVE behavior. Let‘s say kernel woke up Thread A.
  • Thread A: finishes epoll_wait()
  • Kernel: receives 2 bytes of data
  • Kernel: one thread is waiting on epoll, kernel wakes up Thread B.
  • Thread A: performs read(2048) and reads full buffer of 2048 bytes.
  • Thread B: performs read(2048) and reads remaining 1 byte of data

In this situation the data is split across two threads and without using mutexes the data may be reordered.

Epoll edge triggered

Okay, so maybe edge triggered model will do better? Not really. The same race condition occurs:

  • Kernel: receives 2048 bytes of data
  • Kernel: two threads are waiting for the data: A and B. Due to the "edge triggered" behavior only one is notified.
  • Thread A: finishes epoll_wait()
  • Thread A: performs a read(2048) and reads full buffer of 2048 bytes
  • Kernel: the buffer is empty so the kernel arms the file descriptor again
  • Kernel: receives 1 byte of data
  • Kernel: one thread is currently waiting in epoll_wait, wakes up Thread B
  • Thread B: finished epoll_wait()
  • Thread B: performs read(2048) and gets 1 byte of data
  • Thread A: retries read(2048), which returns nothing, gets EAGAIN

Correct solution

The correct solution is to use EPOLLONESHOT and re-arm the file descriptor manually. This is the only way to guarantee that the data will be delivered to only one thread and avoid race conditions.

Conclusion

Using epoll() correctly is hard. Understanding extra flags EPOLLONESHOT andEPOLLEXCLUSIVE is necessary to achieve load balancing free of race conditions.

Considering that EPOLLEXCLUSIVE is a very new epoll flag, we may conclude that epoll was not originally designed for balancing load across multiple threads.

In the next blog post in this series we will describe the epoll "file descriptor vs file description" problem which occurs when used with close() and fork calls.

时间: 2024-08-02 09:39:31

poll的相关文章

I/O多路转接   ----   poll

一.poll poll的实现和select非常相似,只是描述fd集合的方式不同,poll使用pollfd结构而不是select的fd_set结构,其他的都差不多. 二.poll相关函数 #include <poll.h> int poll(struct pollfd *fds, nfds_t nfds, int timeout); //fds: pollfd结构体 events: 要监视的事件 revents: 已经发生的事件,  设置标志 来反映相关条件的存在 常量            

多路复用之select、epoll、poll

IO的多路复用:一个进程可以监视多个描述符,一旦某个描述符读就绪或写就绪,能够通知进程程序进行相应的读写操作 使用场景: 1.当客户处理多个描述符(网络套接口)或一个客户同时处理多个套接口 2.TCP服务器既要处理监听套接口又要处理已经连接的套接口 3.一个服务器处理多个服务或多个协议也要使用I/O复用 与多进程和多线程相比,I/O多路复用最大优点系统开销小,系统也不必创建进程或线程,因而也不用维护这些进程和线程 支持I/O多路复用的系统调用:select.poll.epoll本质上都是同步IO

I/O多路转接之poll

不同与select使用三个位图来表示三个fdset的方式,poll使用一个 pollfd的指针实现. pollfd结构包含了要监视的event和发生的event,不再使用select"参数-值"传递的方式.同时,pollfd并没有最大数量限制(但是数量过大后性能也是会下降). 和select函数一样,poll返回后,需要轮询pollfd来获取就绪的描述符. 从上面看,select和poll都需要在返回后,通过遍历文件描述符来获取已经就绪的socket.事 实上,同时连接的大量客户端在一

多路I/O poll编写服务器

一.poll (多路复用I/O poll) 和select()函数一样,poll函数也可以执行多路I/O复用,但poll与select相比,没有像select那样构建结构体的三个数组(针对每一个条件分别有一个数组:读事件,写事件,异常),然后检查从0到nfds每个文件描述符.poll采用了一个单独的结构体pollfd数组,由fds指针指向这个组.pollfd结构体定义如下: #include <sys/poll.h> struct pollfd {int fd; //文件描述符short ev

Linux中的select,poll,epoll模型

Linux中的 select,poll,epoll 都是IO多路复用的机制. select select最早于1983年出现在4.2BSD中,它通过一个select()系统调用来监视多个文件描述符的数组,当select()返回后,该数组中就绪的文件描述符便会被内核修改标志位,使得进程可以获得这些文件描述符从而进行后续的读写操作.select目前几乎在所有的平台上支持,其良好跨平台支持也是它的一个优点,事实上从现在看来,这也是它所剩不多的优点之一.select的一个缺点在于单个进程能够监视的文件描

select、poll、epoll之间的区别总结[整理]

select,poll,epoll都是IO多路复用的机制.I/O多路复用就通过一种机制,可以监视多个描述符,一旦某个描述符就绪(一般是读就绪或者写就绪),能够通知程序进行相应的读写操作.但select,poll,epoll本质上都是同步I/O,因为他们都需要在读写事件就绪后自己负责进行读写,也就是说这个读写过程是阻塞的,而异步I/O则无需自己负责进行读写,异步I/O的实现会负责把数据从内核拷贝到用户空间.关于这三种IO多路复用的用法,前面三篇总结写的很清楚,并用服务器回射echo程序进行了测试.

linux 下借助poll延时(毫秒)

#include <poll.h> void Sleep(long ms) { poll(0,0,ms); } int main() { Sleep(500); return 0; } 测试: $ time ./a.out real 0m0.504s user 0m0.000s sys 0m0.000s sleep>Sleep>usleep 分别相差100倍

20150218【改进Poll定时查询】IMX257实现GPIO-IRQ中断按键获取键值驱动程序

[改进Poll定时查询]IMX257实现GPIO-IRQ中断按键获取键值驱动程序 2015-02-18 李海沿 按键驱动程序中,如果不使用read函数中使程序休眠的,而是还是使用查询方式的话,可以使用Poll函数,来控制一定时间内,如果有按键发生,则立即返回键值. 同时,poll也可以同时监控多个(比如说按键,鼠标,等)一旦发生事件则立即返回. 我们在linux查看帮助: 从帮助中的说明得知, poll, ppoll - wait for some event on a file descrip

Linux通信之poll机制分析

poll机制分析 韦东山 2009.12.10 所有的系统调用,基于都可以在它的名字前加上"sys_"前缀,这就是它在内核中对应的函数.比如系统调用open.read.write.poll,与之对应的内核函数为:sys_open.sys_read.sys_write.sys_poll. 一.内核框架: 对于系统调用poll或select,它们对应的内核函数都是sys_poll.分析sys_poll,即可理解poll机制. sys_poll函数位于fs/select.c文件中,代码如下:

linux poll函数

poll函数与select函数差不多 函数原型: #include <poll.h> int poll(struct pollfd fd[], nfds_t nfds, int timeout); struct pollfd的结构如下: struct pollfd{ int fd: // 文件描述符 short event:// 请求的事件 short revent:// 返回的事件 } 每个pollfd结构体指定了一个被监视的文件描述符.第一个参数是一个数组,即poll函数可以监视多个文件描