Epoll is fundamentally broken (2017)

rwaksmunski · on Oct 23, 2022

Linux was the last to implement "events", they had IO completion ports and kqueue to learn from. They decided to implement something broken and document it as such, it was right in the man page. I find that astonishing. That reminds me, dnotify and inotify are probably still there. I'm bitter because I've had to port a clean kqueue event loop to Linux as a younger programmer and it was a dreadful experience.

ajross · on Oct 23, 2022

Was your experience the same thing as the linked article? I mean, reading this as someone who knows epoll and not kqueues quite as much, it sounds very much like "I tried to use this API like something I already know and it didn't work, so I hate it and it's broken".

The linked article basically goes through a bunch of scenarios examining clumsy ways to chase a slightly-obscure requirement ("wake up exactly one thread per event using a single descriptor") just to land in the very last paragraph on the clearly correct solution ("just use ONESHOT, that's what it's for!").

I mean, there's literally a feature right there in the man page[1] that does exactly what the author wants. They just didn't want to learn about it and view their ignorance as a bug in the software.

Meh. This gets tiresome. As the article linked yesterday points out, epoll() is the "API that powers the internet", and has effectively solved the C10k problem for everyone that has it.

But yeah, you need to read the man page.

[1] No, seriously, it's right there in the man page, discussing exactly this scenario and how to avoid it.

naasking · on Oct 23, 2022

> No, seriously, it's right there in the man page, discussing exactly this scenario and how to avoid it.

The point is, do all of the other configurations for epoll have legitimate usecases justifying the complexity and need for those parameters? The kqueue design scales from single-threaded to multithreaded scenarios without issue and without all of these pitfalls, so why not just adopt that design? Why does the specific issue need to be a solution described in the man page at all?

> As the article linked yesterday points out, epoll() is the "API that powers the internet", and has effectively solved the C10k problem for everyone that has it.

Being able to solve a problem and doing it well are not the same. The latter arguably deserves criticism.

ajross · on Oct 23, 2022

> The latter arguably deserves criticism.

Ahem. The headline under discussion is "fundamentally broken", which given the context you've granted, sounds closer to a lie than the truth to me.

Meh. OS advocates will always have their aesthetic wars, they have for decades and it's not going to stop. I'm just trying to draw a line at what constitutes appropriate criticism, and this went way over the line.

naasking · on Oct 23, 2022

> The headline under discussion is "fundamentally broken", which given the context you've granted, sounds closer to a lie than the truth to me.

If an abstraction requires a series of special incantations to use correctly, and the other modes of use don't have legitimate use cases of their own, "fundamentally broken" sounds like a fair conclusion.

ajross · on Oct 23, 2022

> a series of special incantations

These are the words people use to describe something they don't like but which they don't understand. Really, just read the man page. It's a low level OS facility; these aren't simple APIs (neither is kqueue, FWIW). If you can figure out file descriptor passing or socket options, you can figure out epoll. If you really can't figure out epoll, you probably want to be working at a different layer of the abstraction stack.

naasking · on Oct 24, 2022

No, it's pretty well understood at this point that epoll is worse by comparison. You can see it quite clearly in the number of gotchas and the limitations when compared to kqueue:

https://habr.com/en/post/600123/#linux-and-epoll

ajross · on Oct 24, 2022

Again it amuses me that you've now flipped twice between "worse by comparison / deserving of criticism" and outrageous hyperbole like "fundamentally broken". I can't tell if you want to debate the minutiae of OS synchronization abstractions or just yell at me. I'm happy to do the former if you'll just put the knives away.

naasking · on Oct 24, 2022

I already addressed this. There's no flipping, they mean the same thing in this context; epoll was supposed to allow the expression of a certain class of programs, but it instead instead encouraged a bunch of broken, fragile programming patterns, and epoll tiself required numerous revisions and extensions to to actually get the behaviour it should have had to begin with, and also limited to a subset of the kernel abstractions kqueue supports. That's considerably worse than kqueue to the point where it's broken. I don't see any further need to discuss it frankly, we've each said our piece.

ajross · on Oct 24, 2022

> [reasoned debate and outrageous hyperbole] mean the same thing in this context;

Then with all respect, I think your strategy for debate is fundamentally broken, not unlike your understanding of the epoll facility or ability to read its man page, so I take my leave.

naasking · on Oct 24, 2022

> I think your strategy for debate is fundamentally broken

I find that pretty rich considering you provided literally no evidence of your claims, where I provided at least one detailed analysis of both epoll and kqueue in addition to the article for this thread, and yet you somehow feel entitled to make personal attacks on my technical abilities. But sure, you do you. Best of luck with that.

rwaksmunski · on Oct 23, 2022

I've had to deal with forks and threads, epoll and inotify. It was beyond my skills at the time.

nextaccountic · on Oct 23, 2022

Is inotify broken? What should be used instead of it to watch for filesystem events?

mastax · on Oct 23, 2022

http://wingolog.org/archives/2018/05/21/correct-or-inotify-p...

nextaccountic · on Oct 23, 2022

Thanks!

Does fanotify fix those issues?

scaramanga · on Oct 24, 2022

Lose events due to inotify buffer overflow, or provide backpressure on entire vfs and allow a watcher to DoS the entire system. Pick one.

Similar "minor implementation issues" occur when considering how to fix all of the other "bugs."

It's fine if you don't like the design trade-offs of a particular solution. But pretending that they aren't trade-offs, or they aren't valid choices, is just going to leave people shaking their heads at you.

Edit: and by "you" i mean the authors of these articles. Although I think Marek is a very talented guy and probably just a bit hyperbolic with his titles.

rwaksmunski · on Oct 23, 2022

I don't know man, I've given up and handed it over to someone more senior. Their port was half the performance and the project was killed. We kept on using FreeBSD for the product.

Beltalowda · on Oct 23, 2022

For watching filesystem events FreeBSD's kqueue is much more difficult to use correctly though, and in my experience it certainly doesn't outperform inotify.

inotify (and now, fanotify) are specifically designed to watch for filesystem events, whereas kqueue is a generic event system used for many things. In many ways that's better, but it this lack of specialisation also comes with some drawbacks.

rwaksmunski · on Oct 23, 2022

From what I can recall, the performance issues were on the network side. I suspected accept() calls not scaling right, but instrumentation on Linux was quite poor at that time. They had nothing like dtrace back then so I can't be certain.

nextaccountic · on Oct 23, 2022

Is it broken regarding to correctness or just performance? I mean, does inotify have some obscure race condition or drop events or duplicate events or something

rwaksmunski · on Oct 23, 2022

> Is it broken regarding to correctness or just performance?

Yes.

This blog entry explains it much better then I could: https://wingolog.org/archives/2018/05/21/correct-or-inotify-...

jgerrish · on Oct 23, 2022

Jesus, is it sad when I hear senior now I don't dream of leading the design of big systems?

I think of stakeholder meetings and fighting business requirements?

I had an inotify project in process for creating a better developer experience. Seeing this hit Hacker News tells me it's a fucking political landmine.

Sigh.

Don't worry, I'll have the strength to argue the proper technical solution that still satisfies business needs. That's the important thing!

nextaccountic · on Oct 23, 2022

I don't understand what you mean.

Anyway I used inotify multiple times in various programming languages and it was always fine, I'm a bit surprised that it's considered "broken" in any way.

I mean, is it broken regarding to correctness or just performance?

jgerrish · on Oct 23, 2022

I don't consider it broken, I was referring to political will to build the healthy base of a developer ecosystem.

But, you've inspired me to do a deeper historical and technical deep dive of the filesystem event space. Thank you.

Have a great weekend!

jgerrish · on Oct 25, 2022

And I do understand. I am one who reached out. Thank you for showing me.

They understand who I am.

Have a good week!

adontz · on Oct 23, 2022

My experience of epoll is that it's for single process, single thread network I/O mostly. If you want multithreaded network server use SO_REUSEPORT, and one epoll per thread. Everything else is either broken, overcomplicated or slow.

mritun · on Oct 23, 2022

This is the correct answer (on Linux) and the article dismisses it after lightly touching on it’s correctness.

In a multi-threaded environment someone needs to pay the cost of synchronization if the entire event-queue is loadbalanced. If you don’t want the events to be trivially load-balanced and want the events from one fd to be delivered to a single worker, then it’s way better to use SO_REUSEPORT and get it right from the get go.

Expecting kernel to solve a problem of user-space’s making is asking for trouble - can be done, but the edge cases will sink your project!

naasking · on Oct 23, 2022

> Expecting kernel to solve a problem of user-space’s making is asking for trouble

These are problems the kernel has introduced. I'm not sure you've read article carefully enough.

wbl · on Oct 23, 2022

Or use IOCP on Windows or kqueue on FreeBSD. IOCP is particularly interesting as it will interact with the scheduler and release more threads if ones get blocked, trying to right-size CPU usage.

throwaway858 · on Oct 23, 2022

The Go runtime (and GHC runtime) use epoll from multiple threads and don't use SO_REUSEPORT and are highly efficient. I do believe that they use one epoll per thread.

scaramanga · on Oct 24, 2022

Another correct solution which avoids the issue which is entirely of userspaces creation :)

jbluepolarbear · on Oct 23, 2022

This is what I was thinking reading this. Why are they reading the the same buffer in 2 threads. This problem will always happen if reading the same buffer in 2 threads.

naasking · on Oct 23, 2022

They are not. This happens for any fd, like a socket waiting for a connection.

dxuh · on Oct 23, 2022

I have used select and poll before and sort of skipped to io_uring for a recent project, which also doesn't do too well with multiple threads (you have to use multiple rings and do everything else yourself). It's a shame that there is no obvious, relatively easy to use async IO mechanism on Linux that you can use from multiple threads without getting yourself in trouble. Reading more about kqueue it looks a lot like something that would solve it. Why was it not ported to Linux? Would it have been too hard to integrate? The linked article is great btw. Very informative and concise!

LaLaLand122 · on Oct 23, 2022

> you have to use multiple rings and do everything else yourself

Since you are supposed to use liburing, not the kernel interface directly, I guess somebody could add multithreading "support" to it.

Or at least add documentation/examples of the most common/performant options: https://github.com/shuveb/loti/issues/4

AFAIR Windows IOCP handles multithreading by:

- Handling locking at kernel level, the syscall is thread safe

- Making it LIFO, to keep things in the same threads, to have a decent cache behaviour.

It's as simple as it gets.

ilyt · on Oct 23, 2022

> I have used select and poll before and sort of skipped to io_uring for a recent project, which also doesn't do too well with multiple threads (you have to use multiple rings and do everything else yourself

Wouldn't that be on purpose ? Coordination requires more cpu cycles and so cuts on max performance

> Why was it not ported to Linux?

We've been asking since epoll got introduced.

But here is some context: https://lwn.net/Articles/431297/

kasabali · on Oct 23, 2022

> Why was it not ported to Linux?

NIH

thatcherc · on Oct 23, 2022

National Institutes of Health?

cassepipe · on Oct 23, 2022

Not Invented Here, old chap

tyingq · on Oct 23, 2022

not invented here

samsquire · on Oct 23, 2022

I wrote an epoll echo server that multiplexes multiple network connections over threads (multiple users per thread)

https://github.com/samsquire/epoll-server

I also have a 1:M:N (1 scheduler thread, M kernel threads and N lightweight green threads) multithreaded userspace scheduler which multiplexes lightweight threads onto kernel threads and can preempt hot loops with minimal overhead. I rely on the fact that you can change the looping variable from another thread if you use a structure. Preemptive interruption is very useful for the illusion of multitasking. That's why I call it a userspace scheduler.

https://GitHub.com/samsquire/preemptible-thread

I think the epoll-server which is kind of similar to what libuv does and the userspace scheduler could be combined into an application server.

I also wrote a multithreaded actor implementation in Java. Threads can communicate with each other between 60 million - 100 million messages a second. The epoll-server uses a multiconsumer multiproducer lockless RingBuffer.

https://GitHub.com/samsquire/multicersion-concurrency-contro...

I think the core fundamentals of building a performant application server should be done once and reused for each application.

I want to also split the threading used by recv and send of a socket so that we have a 1:RecvKernelThread SendKernelThread with 1 RK+SK assigned to Socket:N scheduling (1 scheduler thread, 1 assigned Recv thread, 1 assigned send thread per socket). So you can send while you receive and receive while you send. True multiplexing!

We can decouple CPU and IO completely with threading.