Why is multithreaded performance important, what are the usecases where you cann...

lostdog · on Oct 17, 2021

Here's a use case: I was training a neutral net, and wanted to do some preprocessing (similar to image resizing, but without an existing C function). Inputs are batched, so the preprocessing is trivially parallelizable. I tried to multithread it in python, and got no speedup at all.

That was a really sad moment, and I've never felt good about python since.

isoprophlex · on Oct 17, 2021

Yeah, I had something similar.

I wanted "as you wait for the GPU to churn through this batch, start reading the next batch from disk & preprocessing it on the CPU"

Getting this to work turned out so ass backwards it made me sad

Also I pity the fool who tries to connect a debugger to code using multiprocessing.Pool()...

redeyed · on Oct 23, 2021

> Also I pity the fool who tries to connect a debugger to code using multiprocessing.Pool()

PyCharm works fine

redeyed · on Oct 23, 2021

Yeah, had the exact problem.

I solved it by using python threads and C function (cv2.resize)

chrisseaton · on Oct 17, 2021

Communicating between processes is more expensive than communicating between threads.

TekMol · on Oct 17, 2021

Ok, but what is the use case where you need high bandwidth inter-thread/process communication?

moron4hire · on Oct 17, 2021

Games. Simulations. Backend servers for shared editor experiences. Teleconferencing.

chrisseaton · on Oct 17, 2021

Except for embarrassingly parallel problems, the trade-off of generating more parallelism is usually needing finer-grained communication. Canonical examples in the literature are matrix multiplication, triangulation, and refinement.

Spivak · on Oct 17, 2021

Large shared state is basically always the answer. You can cop-out and say use a database or Redis if that’s fast enough but that’s just making someone else use many threads with shared memory.

adgjlsfhk1 · on Oct 17, 2021

matrix multiply is an obvious one. partial differential equations are another. Sorting is one if you don't care about math.

rbjorklin · on Oct 17, 2021

One thing that comes to mind is for shared connection pools when you have thousands of workers connecting to the same stateful service.

TekMol · on Oct 17, 2021

Which use case requires such a setup?

Spivak · on Oct 17, 2021

MySQL comes to mind. Unlike Postgres where connections are expensive MySQL encourages loading up the server with hundreds of simultaneous connections per server.

Like anything, it’s possible to split this work out to a separate process but the IPC overhead is a lot.

lazide · on Oct 17, 2021

Most databases (unfortunately), and since a ton of web apps use databases for state..

That said, many such databases are (or already have) rolled out connection pool proxies for reasons like this, so meh.

lanstin · on Oct 17, 2021

Web front end backed by data base with expensive login, so you cache connections. Two tier architecture it would be called.

VWWHFSfQ · on Oct 17, 2021

multiple processes use a lot more memory than threads

TekMol · on Oct 17, 2021

Can you quantify that?

redeyed · on Oct 23, 2021

There is such problem on Windows. When you create new process (spawn) it copies all memory of parent process.

SkittyDog · on Oct 17, 2021

Yes, you can

lazide · on Oct 17, 2021

Even with copy-on-write? Have any actual numbers?

kaba0 · on Oct 17, 2021

What about context switches as well, much slower IPC, and basically “no native support”.

lazide · on Oct 18, 2021

Depends on the workload if it matters or not. Some it matters a ton, some not a bit. And if you have n=cores processes, you can get some benefits with collocation and the like the OS can do for you.

kaba0 · on Oct 18, 2021

But I also meant to highlight that you can’t simply lock on an object, can’t yield/notify other threads, etc.

While performance may not matter, productivity very much suffers from multi-process programming.

lazide · on Oct 18, 2021

I’m not saying performance is not different between these types of solutions or the dev work is not sometimes different.

I’m saying that the penalties (including development effort) for work co-ordinated between processes compared to threads can vary from nearly zero (sometimes even net better) to terrible depending on the nature of the workload. Threads are very convenient programmatically, but also have some serious drawbacks in certain scenarios.

For instance fork()’d sub-processes do a good job of avoiding segfaulting /crashing/dead locking everything all at once if there is a memory issue or other problem when doing the work, it’s very difficult in native threading models to do per-work resource management or quotas (like maximum memory footprint, or max number of open sockets, or max number of open files) since everything is grouped together at the OS level and it’s a lot of work to reinvent that yourself (which you’d need to do with threads). Also, the same shared memory convenience can cause some pretty crazy memory corruption in otherwise isolated areas of your system if you’re not careful, which is not possible with separate processes.

I do wish the python concurrency model was better. Even with it’s warts, it is still possible to do a LOT with it in a performant way. Some workloads are definitely not worth the trouble now however.

SkittyDog · on Oct 17, 2021

Actual work is left as an exercise for the reader ;-)

lazide · on Oct 17, 2021

Last I did this, when the processes were fork()‘s of the parent (the typical way this was done), memory overhead was minimal compared to threads. A couple %. That was somewhat workload dependent however, if there is a lot of memory churn or data marshaling/unmarshalling happening as part of the workload, they’ll quickly diverge and you’ll burn a ton of CPU doing so.

Typical ways around that include mmap’ng things or various types of shared memory IPC, but that is a lot of work.

adgjlsfhk1 · on Oct 17, 2021

each python process requires somewhere around 200mb of memory and .1s to do nothing. if you want libraries, it scales from there.

TekMol · on Oct 17, 2021

Really? That sounds like an aweful lot.

When I execute this:

    python3 -c 'import time; time.sleep(60)'

And then pmap the process id of that process, I get 26144K. That is 26MB.

As for timing, when I execute this:

   time python3 -c ''

I get 0.02s

lazide · on Oct 17, 2021

Also generally no one spins up distinct, new processes for the ‘co-ordinated distinct process work queue’ when they can just fork(), which should be way faster and pretty much every platform uses copy-on-write for this, so also has minimal memory overhead (at least initially)

jesboat · on Oct 17, 2021

The problem is (perhaps amusingly) with refcounting. As the processes run, they'll each be doing refcount operations on the same module/class/function/etc objects which causes the memory to be unshared.

lazide · on Oct 17, 2021

Only where there is memory churn. If you’re in a tight processing loop (checksumming a file? Reading data in and computing something from it?) then the majority of objects are never referenced or dereferenced from the baseline.

Also, since the copy on write generally is memory page by memory page, even if you were doing a lot of that, if most of those ref counts are in a small number of pages, it’s not likely to really change much.

It would be good to get real numbers here of course. I couldn’t find anyone obviously complaining about obvious issues with it in Python after a cursory search though.

byroot · on Oct 17, 2021

That was solved years ago by moving the refcounts into different pages: https://instagram-engineering.com/copy-on-write-friendly-pyt...

kzrdude · on Oct 17, 2021

Interactive use of Python: plotting, working with data, also would benefit from better multithreading. It's interactive, so it's (a bit) frustrating to wait for it to compute and see that it uses just one thread (the statistics ops are usually well threaded already, but plotting is not).