With the API being nearly the same, I keep just thinking that Virtual Threads ar...

pron · on Sept 22, 2023

The relationship between throughput, latency, and concurrency in servers is expressed via Little's theorem. If your server is written in the thread-per-request style -- the only style for which the platform offers built-in language, VM, and tooling support -- then the most important factor affecting maximum throughput is the number of threads you can have (until, of course, the hardware is fully utilised). Being able to support many threads is the most effective improvement to server throughput you can offer.

See: Why User-Mode Threads Are Good for Performance https://youtu.be/07V08SB1l8c

tantamounta · on Sept 22, 2023

Thanks for the video. I feel like there's a bit of conflation between the terms "performance(latency)" and "throughput", but I see the point. I'd be interested to see that latency graph (Time marker 15:38) between platform and virtual threads in the case where the server doesn't manufacture a 100ms delay (say, in the case of a caching reverse-proxy).

Also - millions of Java programmers thank you for not going to async/await. What an evil source-code virus (among other things that is).

I tried to watch it at 1.25x speed as I normally do, but you already talk at 1.25x speed, so no need !

pron · on Sept 22, 2023

To understand what happens when the server doesn't perform IO, apply Little's formula to the CPU only. Clearly, the maximum concurrency would be equal to the number of cores, which means that in that situation there would be no benefit to more threads than cores. What you would see in the graph would be that the server fails once L is equal to the number of cores. The average ratio between IO and CPU time as portions of the average duration would give you an upper limit on how much more throughput you gain by having more threads. That's what I explain at 11:34.

Also, both throughput and latency are performance metrics.

Svenskunganka · on Sept 22, 2023

I watched the video and thoroughly enjoyed it, thank you for sharing it! I have a question that is perhaps not entirely related to the video, but it touches the topic of context switches. I've read this post [1] by Chris Hegarty, which explains that when calling the traditionally blocking network I/O APIs in the Java stdlib from a virtual thread, it uses asynchronous/poll-based kernel syscalls (IOCP, kqueue, epoll on Windows, Mac and Linux respectively) which I assume is to avoid blocking the carrier threads. That post was written in 2021, does it still hold true today in Java 21?

Reading that, it also makes me wonder what happens for disk I/O? Many other runtimes, both "green thread" ones like Golang and asynchronous like libuv/tokio, use a blocking thread pool (static or elastic) to offload these kernel syscalls to because, from what I've read, those syscalls are not easily made non-blocking like e.g epoll is. Does Java Virtual Threads do the same, or does disk I/O block the carrier threads? For curiosity, does Java file APIs use io_uring on Linux if it is available? It is a fairly recently added kernel API for achieving truly non-blocking I/O, including disk I/O. It doesn't seem to bring much over epoll in terms of performance, but has been a boon for disk I/O and in general can reduce context switches with the kernel by reducing the amount of syscalls needed.

[1]: https://inside.java/2021/05/10/networking-io-with-virtual-th...

pron · on Sept 23, 2023

> That post was written in 2021, does it still hold true today in Java 21?

Yes.

> Does Java Virtual Threads do the same, or does disk I/O block the carrier threads? For curiosity, does Java file APIs use io_uring on Linux if it is available?

We're working on using io_uring where available, especially for filesystem IO. For now, filesystem IO blocks OS threads but we temporarily compensate by increasing the size of the scheduler's worker thread pool.

jlokier · on Sept 23, 2023

In late 2021 I compared OS threads to io_uring for filesystem I/O at random-access reads from fast, NVMe SSDs.

That measurement told me that it's not necessary to use io_uring for disk I/O performance for some workloads.

It found no improvement in performance from io_uring, compared with a dynamic thread pool which tries to maintain enough I/O-blocked threads to keep the various kernel and device queues busy enough.

This was a little surprising, because the read-syscall overhead when using threads was measurable. preadv2() was surprisingly much slower than pread(), so I used the latter. I used CLONE_IO and very small stacks for the I/O threads (less than a page; about 1kiB IIRC), but the performance was pretty good using only pthreads without those thread optimisations. Probably I had a good thread pool and queue logic, as it surprised me that the result was much faster than "fio" banchmark results had led me to expect.

In principle, io_uring should be a little more robust to different scenarios with competing processes, compared with blocking I/O threads, because it has access to kernel scheduling in a way that userspace does not. I also expect io_uring to get a little faster with time, compared with the kernel I tested on.

However, on Linux, OS threads* have been the fastest way to do filesystem and block-device I/O for a long time. (* except for CLONE_IO not being set by default, but that flag is ignored in most configurations in current kernels),

IAmLiterallyAB · on Sept 23, 2023

> less than a page; about 1kiB IIRC

Interesting, didn't realize the kernel would let you do that. I guess it makes sense since it's up to user space to map pages for the stack. The kernel doesn't have much to do on clone except set the stack pointer.

statquontrarian · on Sept 23, 2023

That is an absolutely amazing video. From a brief, intuitive, and well-diagrammed explanation of non-trivial concepts of queuing theory, to practical examples, to connecting it all to real-world use cases and value, and all within a surprisingly short period of time, it is one of the most impressive technical videos I've ever seen. Thank you.

Mawr · on Sept 23, 2023

"With the currency being the same, I keep thinking a salary of $50k/year is the same as that of $500k/year. Are there any other actual differences?"

Just as with performance improvements [1][2][3][4], the actual impact on the user experience is non-linear and often hard to predict. In the case of virtual threads, you go from needing to consciously work around a limited amount of available threads to spawning one per request and moving on.

[1]: https://youtu.be/4XpnKHJAok8?t=3026

[2]: "These tests are fast enough that I can hit enter (my test-running keystroke) and have a response before I have time to think. It means that the flow of my thoughts never breaks." - https://news.ycombinator.com/item?id=7676948

[3]: https://news.ycombinator.com/item?id=37277885

[4]: "Go’s execution tracer has suffered from high overhead since its inception in 2014. Historically this has forced potential users to worry about up to 20% of CPU overhead when turning it on. Due to this, it's mostly been used in test environments or tricky situations rather than gaining adoption as a continuous profiling signal in production." - https://blog.felixge.de/waiting-for-go1-21-execution-tracing...

noelwelsh · on Sept 22, 2023

The context switch time is much smaller, so yes, better performance.