If you use threads, green or otherwise, you don't have to "implement" special code for composing things together, you get the full set of tools for composing code together, which includes, in passing, state machines, among all the other things it includes. This basically implements an Inner Platform Effect of an internal data-based language for concurrency that the language interprets, which will A: forever be weaker than the exterior language (such is the nature of the Inner Platform Effect) and B: require a lot of work that is essentially duplicating program control flow and all sorts of other things the exterior language already has.
There are some programming languages that have sufficiently impoverished semantics that this is their best effort that they can make towards concurrency.
But this is Rust. It's the language that fixes all the reasons to be afraid of threading in the first place. What's actually wrong with threads here? This isn't Java. And having written network servers with pretty much every abstraction so much as mentioned in the article, green threads are a dream for writing network servers. You can hardly believe how much accidental complexity you're fighting with every day in non-threaded solutions until you try something like Erlang or Go. Rust could be something that I mention in the same breath for network servers. But not with this approach.
There's plenty to debate in this post and I don't expect to go unchallenged. But I would remind repliers that we are explicitly in a Rust context. We must talk about Rust here, not 1998-C++. What's so wrong with Rust threads, excepting perhaps them being heavyweight? (Far better to solve that problem directly.)
> What's so wrong with Rust threads, excepting perhaps them being heavyweight? (Far better to solve that problem directly.)
Nothing. If threads work fine, use them! That's what most Rust network apps do, and they work fine and run fast. A modern Linux kernel is very good at making 1:1 threading fast these days.
> And having written network servers with pretty much every abstraction so much as mentioned in the article, green threads are a dream for writing network servers.
Green threads didn't provide performance benefits over native threads in Rust. That's why they were removed.
Based on my experience, goroutine-style green threads don't really provide an performance benefit over native threads. The overhead that really matters for threads is in stack management, not the syscalls, and M:N threading has a ton of practical disadvantages (which is why Java removed it, for example). It's worth going back to the debate around the time of NPTL and look at what the conclusions of the Linux community were around this time (which were, essentially, that 1:1 is the way to go and M:N is not worth it).
There are benefits to be had with goroutine-style threads in stack management. For example, if you have a GC that can relocate pointers, then you can start with small stacks. But note that this is a property of the stack, not the scheduler, and you could theoretically do the same with 1:1 threading. It also doesn't get you to the same level of performance as something like nginx, which forgoes the stack entirely.
If you really want to eliminate the overhead and go as fast as possible, you need to get rid of not only the syscall overhead but also the stack overhead. Eliminating the stack means that there is no "simple" solution at the runtime level: your compiler has to be heavily involved. (Look at async/await in C# for an example of one way to do this.) This is the approach that I'm most keen on for Rust, given Rust's focus on achieving maximum performance. To that end, I really like the "bottom-up" approach that the Rust community is taking: let's get the infrastructure working first as libraries, and then we'll figure out how to make it as ergonomic as possible, possibly (probably?) with language extensions.
My overarching point is this: it's very tempting to just say "I/O is a solved problem, just use green threads". But it's not that simple. M:N is obviously a viable approach, but it leaves a lot of performance on the table by keeping around stacks and comes with a lot of disadvantages (slow FFI, for example, and fairness).
> Green threads didn't provide performance benefits over native threads in Rust. That's why they were removed.
This seems wrong. This shouldn't be the point of green threads -- green threads aren't a "faster" alternative to native threads. How would that work? How could it be the case that virtual threads implemented in user space are faster than native threads provided by the OS? I don't think anyone expects that to be true of green threads in general.
We don't have green threads as a useful abstraction because they're fast, we have them because they're flexible, and because they allow us to easily capture a lot of patterns. They let you easily and efficiently distribute work amongst multiple processors without having to think about the subtleties of the underlying threading model.
Of course, given Rust's other goals, shying away from green threads makes sense.
Rust's green threading was... unusual. It attempted to abstract green and native threads into a single API, which lead to a lot of unnecessary overhead:
1. Stackless coroutines will probably require help from the compiler team (hint hint). It'll be worth it! Think web frameworks with Nginx performance and Rust memory safety.
2. Documentation, stability, portability, quality. I'll keep banging that drum until there's a standard. :)
3. The Mio API is subjectively weird, but I might be missing something.
The first point has been on the radar for quite a while (i.e. years), but it's really non-trivial, and not at all necessary for the core language (hence not a priority for stability) since it's basically just some syntactic niceties over existing functionality.
Yes, I know. That's exactly what I mean by syntactic sugar: it is an automated way to write something that could already be written, but is annoying to do by hand.
I'm confused. To me, green threads are just an implementation detail. The fact that green threading is being used shouldn't leak into the interface. To take Go for an example, you could perfectly well write a conforming implementation of Go that used 1:1 native threading: it would just have different performance characteristics and things like LockOSThread() would become a no-op.
Could you elaborate on what you consider the interface difference to be?
This is backwards. Green threading can be an implementation detail, but in languages where it's a key feature, green threads let you write code that would otherwise be incorrect. For example, in a native threading model, it is an awful idea to spawn a thread for every incoming connection on a server. It's the easy way to write it, but it's wrong. With a green threading model, though, that's easy and efficient.
> For example, in a native threading model, it is an awful idea to spawn a thread for every incoming connection on a server. It's the easy way to write it, but it's wrong. With a green threading model, though, that's easy and efficient.
A thread per connection isn't wrong, though. You're begging the question by assuming that green threads are faster than native threads. I'm specifically arguing against that.
Yeah, outside of Linux threads are dog slow. But Linux's implementation demonstrates that they don't have to be, and if you have serious scalability problems you're going to know your architecture well in advance.
As I already said, green threads are not faster than native threads. That makes no sense. Green threads, due to smaller stacks and cheaper context switches, are more scalable and predictable at the cost of extra user land bookkeeping.
The point is: who cares? Do you really need 3 million threads (particularly, 3 million threads that aren't as efficient as ordinary threads for many tasks)? Ostensibly Google does about 3 billion queries a day; that's less than 35,000 queries per second. Obviously, there are many services underlying any Google search, but most of them will be distributed among multiple cores, machines, and datacenters. So what eventuality are you preparing for where regular threads won't be sufficient?
The point isn't that I need 3 million threads. The point is that I can easily write code that is correct and behaves predictably even if I end up throwing some arbitrary number of threads at the processor.
As I view them, green threads are a useful abstraction just like objects or functions. Do you need something computed, not necessarily now, but at some point? Just make a green thread to compute it and check on the result later. That's the platonic ideal of green threads, anyway: very few languages have gotten to that point (Erlang, maybe Go, on a good day; C# has this with its async functionality but none of it happens in parallel).
> The point is that I can easily write code that is correct and behaves predictably even if I end up throwing some arbitrary number of threads at the processor.
But you can't. Green threads have scalability limitations too, and their performance is not predictable (except possibly by the creators of the runtime) in any language or on any processor that does automatic preemption. That's like saying you use i64 instead of i32 because now you can do calculations without worrying about overflow. It may be true for some classes of problem, but it's hardly a different programming paradigm like you're making it out to be. They're definitely not a different abstraction from threads (which are themselves an abstraction!).
I was curious what the actual state was of the "modern Linux kernel" pcwalton mentioned, so I tried running a test program to create a million threads on a VM - x86-64 with 8GB of RAM, Linux 4.0. For comparison, Go apparently uses about 4KB per goroutine, so it should be possible to create somewhat under 2 million goroutines. To be fair, I allocated the stacks manually in one large allocation; otherwise it dies quite quickly running out of VM mappings. I set the stack size to the pthread minimum of 16KB (actually, I tried cheating and making it smaller, but it crashed, so I gave up - not a good idea anyway). The threads waited for an indication from the main thread, sent after thread creation was done, to exit; in an attempt to avoid the overhead associated with pthread conditions, I just used futex directly:
while (!ready)
assert(!syscall(SYS_futex, &ready, FUTEX_WAIT, 0, NULL, NULL, 0));
The program caused the kernel to hit OOM (rather ungracefully!) somewhere around 270,000 threads. To see how long it took while ensuring all the threads actually ran, I reduced the thread count to 200,000, had it join all the threads at the end, and timed this whole process: after the first run it took about 4 seconds. (The first run was considerably slower, but that isn't a big deal for a server, which is the most important use case for having such a large number of goroutines/threads.) Therefore, the C version uses about 20 microseconds and 32 KB of memory per thread.
For completeness, I also tested a similar Go program on Go 1.4 (the version available from Debian on the VM); it actually got up to 3,150,000 before OOM, and took 9 seconds to do 2 million - 4.5 microseconds and 2.7KB per thread.
In other words, Linux is about an order of magnitude slower at managing a lrge number of threads. That looks pretty bad, but on the other hand, it's not that much in absolute terms! I'm pretty sure most server programs don't need more than 250,000 simultaneous connections (or can afford to spend more than 8GB of RAM on them) and don't mind spending an extra 20 microseconds to initiate a connection, so if operating systems other than Linux aren't a concern, they could be written to create a thread per connection without too much trouble. It's not going to give you the absolute maximum performance (meaning it's not appropriate for a decent class of program - then again, I suspect Go isn't either), but it's not terrible either.
I'd like to see it improve. I wouldn't be surprised if there is (still) some low hanging fruit; do kernel developers actually care about this use case?
(And yes, I know this doesn't really test performance of the scheduler during sustained operation. That's its own can of worms.)
> It's not going to give you the absolute maximum performance (meaning it's not appropriate for a decent class of program - then again, I suspect Go isn't either), but it's not terrible either.
Yeah, this matches our results when we did similar tests. It's definitely faster to use green threads if you're just spawning and shutting down the threads, but if you're actually doing I/O work on those threads, the overhead quickly drops down.
It's not the fastest way to do I/O, but Go's approach isn't either. The fastest way to do I/O is to forego the thread management syscalls and the stack, like nginx does.
>It's not the fastest way to do I/O, but Go's approach isn't either
Though Go's is faster.
>forego the thread management syscalls and the stack, like nginx does.
The problem is nginx uses state machines, which aren't a general solution. Stackless coroutines allow you to write functions that the compiler (or a clever library) transforms into state machines.
> To be fair, I allocated the stacks manually in one large allocation; otherwise it dies quite quickly running out of VM mappings.
Okay, so the test you did doesn't actually reflect the use case in practice. Can I expect to reach 200,000 threads if the threads are not all created at exactly the same moment? What if (God forbid) they're doing memory allocation? And if it does work out, will everything be handled efficiently?
Hope comex replies to your question. Typical green thread usage is spawn-em-as-you-need-em, so if in order to spawn lots of 1:1 threads I need to do it all up front, that could be very limiting or complicating.
Yeah, I just made a mistake - you can increase the maximum number of mappings using /proc/sys/vm/max_map_count; I tried doing that and switching back to normal stack allocation (but still specifying the minimum size of 16KB using pthread_attr_setstacksize) and it doesn't change the number of threads I was able to create.
...in fact, neither did removing the setstacksize call and having default 8MB stacks. I guess this makes sense: of course the extra VM space reserved for the stacks doesn't require actual RAM to back it; there is some page table overhead, but I guess it's not enough to make a significant difference at this number of allocations. Of course, on 32-bit architectures this would quickly exhaust the address space.
If increasing max_map_count hadn't worked, it would still be possible to allocate stacks on the fly - but you would get a bunch of them in one mmap() call, and therefore in one VM mapping, and dole them out in userland. However, in this case guard pages wouldn't separate different threads' stacks, you would have to generate code that manually checks the stack pointer to avoid security issues from stack overflows, rather than relying on crashing by hitting the guard page. Rust actually already does this, mostly unnecessarily; I'm not sure what Go is doing these days but I think it does too. Anyway, given that the above result I suspect this won't be an actual issue, at least until the number of threads goes up by an order of magnitude or something.
That doesn't seem that unrealistic — you could allocate your stacks using slab allocation, for example. I wonder why the Kernel allocator doesn't do a better job though.
The key number for server applications is latency — it doesn't matter if you can spawn 200,000 threads if it takes you four seconds to serve a request. And if a thread can't serve a request before its time slice is up, it gets sent to the back of the line and its cache entries get evicted by other threads.
Coroutines have three awesome advantages here:
Really fast switches (it's basically a function call — no need to trap into the kernel)
Scheduling in response to IO events (a coroutine can yield when making a "blocking" IO call and resume when it completes). You can write asynchronous code that looks synchronous!
Better cache behavior (hugely important on modern processors). You can allocate all your coroutine structures together and don't have to suffer all the cache evictions of a context switch every time a coroutine yields.
It would be interesting to see this benchmark repeated with a latency requirement.
> Scheduling in response to IO events (a coroutine can yield when making a "blocking" IO call and resume when it completes). You can write asynchronous code that looks synchronous!
You can also write "actual" synchronous code which does the same thing in the kernel. :P And there are several CPU schedulers to choose from to fine-tune when things get woken up.
I agree it would be interesting to test latency; actually, the numbers would be much more useful than my four seconds. After all, if thread creation throughput or latency becomes an issue, you can keep a thread pool around, while still retaining the advantages of one connection per thread.
>You can also write "actual" synchronous code which does the same thing in the kernel. :P And there are several CPU schedulers to choose from to fine-tune when things get woken up.
You can. But no matter how good your kernel scheduler is, that trip through it is going to cost you. Why? Because it's going to wreck your cache and TLB entries. I think the TLB is a big part of the story, given it's small, specific to the process, and particularly expensive to miss.
Anyway, I'm sure you know this already. :) Out of curiosity, did you get a chance to take CS169 before you left Brown?
And I think the kernel already does pool/recycle stacks, but I guess not 200,000 of them given your results. An explicit thread pool would certainly work too.
Kernel stacks are allocated by Linux in alloc_thread_info_node; user stacks by glibc in allocate_stack, which does have some sort of cache. Overall there's a fair bit of work going on that could be skipped...
But yeah, certainly userland context switches are faster; it's just that your comment seemed to postulate "scheduling in response to IO events" as a benefit of coroutines over threads, as if the kernel had no way to distinguish between threads waiting for IO and threads needing to be woken up. I presumably misinterpreted it.
Interesting paper. It seems like it could be a useful step toward my ideal imagined environment where the distinction between coroutines and threads would be meaningless - because scheduling would be a job shared between userland and the kernel, so userland could do fast thread switches, batch system calls, etc., while the kernel would still give them PIDs and let them take signals or be ptraced, and native tools (debuggers) would treat them as threads.
Though the paper is from 2010; do you know if anyone has tried to implement anything along similar lines in production?
Oh OK, I definitely could have written that better.
From what I can tell, M:N threading has been thoroughly abandoned by Linux, but it might be worth revisiting given the horrendous I/O scaling of Linux kernel threads. I would imagine scheduling user threads cooperatively simplifies things a lot.
Google has a very interesting talk about cooperative native thread scheduling, but they haven't upstreamed their code:
In the meantime, people who actually need the performance (HPC and even some enterprise servers) have to bypass the kernel entirely. It looks like that's going to be possible even in virtualized environments, thanks to hardware support:
It was a terrible idea with old OS's and engineers never learned about their schedulers. Many of your favorite systems run at high frequencies on Varnish using one thread per session.
> To me, green threads are just an implementation detail. The fact that green threading is being used shouldn't leak into the interface.
I think that mentality, which I supported, may be what ruined green threads for Rust. Perhaps the enemy was the perfect of the good. On the other hand, maybe not having to worry about differences is the better benefit. I would rather be slower-but-acceptably-fast than have to think about 1:1 vs M:N threading in the case that they are not interchangeable.
On the third hand, it could be that Rust is too close to the metal for green threads to be of any significant benefit. If that's the case, Rust ought to trounce all the M:N implementations, subject to reasonable limits on spawn rate, memory pressure and inter-thread communication. Does that sound right?
Incidentally, I've never seen a cooperative threading implementation that didn't have a yield call of some kind available, and you can usually jam any of them up by writing an infinite computation-only loop, so even in the langauges where the scheduler is automatically invoked, correct 1:1 code can deadlock in the M:N scenario. State machines can too, but a trivial example, translated, might read something like the following:
for(;;);
exit(EXIT_SUCCESS);
I've seen code like this written by complete newbies, but threading constructs can hide that relationship even from experts. In parallel code it isn't a sequential relationship, but it is with cooperative threads. I guess this is all fairly pedantic (such a minor detail, mostly affecting only poorly-written programs), but in the end, one of: yield(), sched(), sync(), etc. does show up in practically all of the cooperative threading interfaces.
The interface is different because if you use a green thread within a GC'd language, you can do anything -- change anything, interact with anything -- and nothing will crash. No locking involved, and no hidden locking under the hood.
It's not meant to be a performance boost, but a way of writing code where "might this crash?" is a question you're never bothered to ask. It's wonderfully freeing.
Not true of native threads. If a goroutine were replaced with a native thread, it would swiftly crash on GC.
> a way of writing code where "might this crash?" is a question you're never bothered to ask.
Is that actually true? Yes, you won't get a partially written int, but you could still run into consistency issues if you don't lock. For example, if you need thread A to update both x and y to have a consistent state, and B is reading both x and y, A might update x, then the scheduler would switch to B, and be might read the new x and the old y. If the update A was making needed to update both x and y for the system to be in a consistent state, you have a problem.
It will do what you just described. And what you described won't crash.
If you use the int as an offset into raw memory, then sure, that will. But if you use it as an offset into a list, then it won't. At most it will cause an exception to be generated. That will cause the reader thread to die, because it was the code that caused the exception, but it still won't crash.
Right, but I don't really see how that's freeing. A crash isn't good, but having your program be incorrect might be even worse. If the program crashes, you know it. But it might take some time before you recognize that your logic is wrong, which might cause all kinds of damage.
So I don't think it frees you from thinking about concurrency at all.
Sure. But I might write my n:1 code more quickly, robustly, and performant than your m:n code, since m:n forces the programmer to carefully worry about whether their code will break.
> Basically if you have green threads with only one core, you don't need to do as much work with locking.
This is half the basis for the original design of the Erlang BEAM VM. SMP was a grudging afterthought; the original concept was that each VM "node" was a single scheduler process bound to a single core, and then you tied those nodes together into a (machine-internal) multi-core distributed system, effectively creating a set of "threads" communicating over socket-based message-passing IPC, rather than shared memory.
M:N scheduling—where, critically, a task can be re-scheduled from one core to another—is a much more complex and less-predictable design, that's not always worth the trouble. For Erlang's particular use-cases, it turned out to be worth the trouble enough to justify programming it, eventually. But the SMP VM is still only built via a configure flag, rather than displacing the M:1 VM as "the" VM.
The argument here is less about the performance difference existing or not, and that the decision in question sounds to have been made due to the lack of something we care less about.
I was responding to the first paragraph, which is technically all about the performance differences. (it does say "How could it be the case that virtual threads implemented in user space are faster than native threads provided by the OS?") Whether that matters and is a valid decision point is a completely different question as you say.
I'm down with all that. I'm really arguing in favor of threading here rather than any particular model of it.
Plus, if you write threaded code, you can change the runtime out. Rust guarantees all the hard parts, anyhow. I wouldn't even be surprised once Rust settles down further it turns out there's some sort of hybrid solution that is better than either 1:1 or M:N on its own because it has superior insight into what its functions are doing.
However, if you are sitting down in front of an editor, faced with the task of writing a network server, and you immediately reach for the overcomplicated solutions like this rather than starting with threads, you are paying an awfully stiff development price for a performance improvement that probably won't even manifest as any useful effect, because the window where this will actually save you is rather small. If you're sitting at 90% utilization, and this sort of tweak can get you to 80% utilization, that's essentially a no-op in the network server space, because it means you still better deploy a second server either way. If you're close to a capacity problem (and the code has been decently optimized to the point where that's not an option anymore), the solution is not to rewrite your code to squeeze out the 10% inefficiency in stack handling, the solution is to deploy another server, and if a rewrite is needed, rewrite to make that possible/effective/practical.
In the Servo design space, it makes perfect sense to be upset that you lost 10% of your performance to green threads, because that's time a real human user is directly waiting, and what does a web browser need with 10,000 threads anyhow? (I mean, sure, set one the deliberate task of using that many threads and they can, but in practice you'd rather be doing real work than any sort of scheduling of that mess.) In the network space it's way less clear... if your infrastructure is vulnerable to 10% variances in performance, your infrastructure is vulnerable to 10% variances in incoming request rate, too.
(10% is a bit of a made up number. I believe, if anything, it is an overestimate. Point holds even if it's a larger number anyhow.)
Are you talking about threads-the-programming-model (vs. events, callbacks, channels, futures/promises, dependency graphs), or are you talking about threads-the-implementation-technique (vs. processes or various select()-like mechanisms)? And if you are talking about threads-the-programming-model, which synchronization mechanism: locks, monitors, channels/queues, transactional memory?
If the various threads in your application don't share data, then you don't need to worry about all the pitfalls of threading. But if they don't share data and you don't care about small (~10%) performance differences, why not use processes? Then you can use a really straightforward blocking I/O model, and you get full memory isolation & security provided by the OS.
The interesting questions happen when you a.) want to squeeze as much performance out of the machine as possible or b.) need to share data between concurrent activities. Then all of the different programming models have pros and cons, and I'm not sure you can define a "best" approach without knowing your particular problem. (Which, IMHO, validates Rust's "provide the barest primitives you can for the problem, and let libraries provide the abstractions until it becomes clear that one library is a clear winner" approach.)
FWIW, my experience in distributed systems is that threads+locks is a terrible model for writing robust systems, and that once you're operating at scale, you really want some sort of dependency-graph dataflow system where you specify what inputs are required for each bit of computation and then the system walks the graph as RPCs come back and input becomes available. This lets you attach all sorts of other information to nodes - timeouts, latency statistics, error statistics, whether or not this node is required and what defaults to substitute if it fails, tracing, logging, load-balancing, etc. It also adds a huge amount of cognitive overhead for someone who just wants to make a couple database queries. I wouldn't use this for prototyping a webapp, but it's also invaluable when you have a production system and ops people who need to be able to shut down a misbehaving service at a moment's notice while still keeping your overall product up.
But it wasn't necessarily the threads, so much as the fine-grained locks.
I'm talking about threads-the-programming-model, where we don't throw away the gains we made with structured programming. I suppose I ought to get off my duff and finish the blog post that explains exactly what that means, but if you know what structured programming is, that's actually enough to explain what I mean. (It's just that nowadays most people don't actually know what it is, not because they have no familiarity with it, but because they have too much, and don't realize that it was not actually the "ground state" of programming but actually a hard-fought advance, which we are all collectively on the verge of spending the next, oh, 5-10 years painfully rediscovering the value of.)
> I am talking about threads the programming model
If parallelism is off the table I would be surprised by any disagreement.I doubt any one finds control inverted style of programming more pleasurable (I.e. Callbacks) than code that looks synchronous but executed async. One thing that does require watching out for is that there are no blocking calls. Green threads, coroutines, fibres seem so much nicer as a programming abstraction. Is this view even contentious?
This does not address parallelism though. These are about exchanging control between stacks of execution, only one such stack is active in an unit. There could be multiple such units though and rub lies in should they synchronize by messages (less buggy) or shared mutable state (at times faster if serialization is expensive. Think Python).
A model that I like is different units of execution ( threads) in a common address space, each hosting coroutines or their ilk with the units sync'ing exclusively by message passing but without the need of serialization because they live in a common address space.
Do you have any links on the "dependency-graph dataflow system" that you are talking about? Sounds a little bit similar to what I'm trying to do, except at higher scale.
Imagine injecting Futures into your code. Now imagine injecting them, but having the Provider record a lot of metadata about who you were calling, how you were calling it, how long it took, whether an error occurred, letting an SRE turn off the call, etc.
> I'm down with all that. I'm really arguing in favor of threading here rather than any particular model of it.
Oh, from an ergonomic point of view I'm in total agreement that imperative-looking code is the best. I've never liked explicit node.js-like continuation passing style or what have you.
I'm not sure what you mean by the "syscall overhead", but I 100% agree that stackless coroutines in Rust would be amazing. But I'm also skeptical that they can be implemented as a library without compiler support. C++ only manages with some very unhygienic macros, switches, and gotos:
Most of my work involves kernel programming (device drivers for hardware or virtualisation companies etc.); almost all of the rest of it is writing networking code (custom protocol implementations). So my reasons for evaluating[1] Rust are that I think C and C++ are pretty awful languages for writing code that's as security and reliability critical as the code I touch on a day to day basis, while memory-managed languages aren't even on the menu for me - kernel hacking is a fairly unpopular niche, and I'm used to it being ignored in terms of programming language and library design. My opinions/statements below are primarily centred around this set of use cases:
* There is little practical difference between regular threads and green threads in an OS kernel. Most OSes can't handle it if you manually mess with the stack in a kernel context anyway. (thread records are typically accessed by rounding the current stack pointer) So if you want to do synchronous-style I/O with real stacks, you'll need an OS thread for each task.
* Kernel thread stacks are fairly small; 8-16KiB are typical. The reason for this is of course that kernel stacks must use wired (non-pageable) memory, or interrupts will irrecoverably page fault. 8-16KiB is much too small for buffers of course, but also enormous compared to the actual amount of non-buffer state for most I/O tasks. In any case, you never ever want to get too close to utilising the theoretical maximum as you risk crashing the system. So for every kernel thread you create, you know you're wasting precious wired memory. Obviously, this is true for threads created from userspace too, but in the kernel, people usually don't have a choice about running your code, so you try to be as good a citizen as possible, and not fire off hundreds of threads.
* In many contexts, dynamic memory allocations are not reliable, and allocation failure must not impede progress. (I.e. I can't wait for the system to page some memory out to disk if my code is on the critical path for disk I/O.) So typically, it is desirable to pre-allocate enough memory to keep all the state required for a sequence of operations from start to finish. It's even less likely you can just spin up a new thread; so the threaded approach implies keeping a pool of threads around which is guaranteed to be big enough. A.k.a. a waste of resources.
* The chain-of-callbacks approach to I/O is even more awful in languages and environments with manually managed memory and resources.
* If I'm going to pick a fancy new language to write my drivers in, I'm still going to have to use the custom alloc/free functions for each type of kernel object (network packet, etc.) that the OS I'm writing against happens to use.
So typically, you end up either splitting your code into a bunch of callbacks, or you create a complicated explicit state machine with a giant dispatch switch() statement. In both cases all state tracked in a giant struct. Keeping track of control flow is tricky. Maintaining invariants is tricky. Making sure you don't leak (or over-free, or use-after-free) any of the resources you touch is tricky.
It'd be really, really nice if the language could help you out with this. Write it as synchronous code, and the compiler turns each location where execution can be suspended into a callback function. The locals that are used across suspension points are stored in an automatically generated struct (bonus points: unions for state which is guaranteed to not have overlapping lifetimes) which can be preallocated before firing off the "task" in question. As far as I'm aware, such a code transformation would effectively be a CPS-transform, (continuation passing style) which has at least been researched quite a bit in theory, if not so much in non-GC-language practice.
I haven't been able to invest large amounts of time into really learning Rust and applying it to my use case in earnest. It's tough without a remotely compliant standard C library around, and I've struggled a bit with trying to pick and choose bits out of Rust's 'core' library without bringing on an avalanche of dependencies. (any kernel module code is kept in wired memory, so unused code is a waste of resources in the kernel) I'm determined to overcome that though and use Rust for something other than a toy kernel module, and see if and how it improves things over C. If Rust does start supporting some kind of advanced I/O pattern that works in that sort of constrained environment, that seems like a significant competitive advantage.
The first's library implementation is based on boost.coroutine which as far as I'm aware uses stack-switching, so this is more of a syntax demo than a demo of the final implementation details. The fact that you can apply sizeof() to the resumable type is important for the kernel-typical pool allocators.
The second seems to have preview support in VS2015's compiler. Too bad I'm not much of a Windows programmer these days… It's not quite clear what ther memory-allocation story is from that article.
Of course the downside is it's yet another graft onto the C++ language. After ~15 years of using it, my point of view is mainly that C++'s type system and unsafe-by-default behaviour inherited from C are a mistake…
I think you should take a look at C#'s async/await infrastructure.
I am also from a low-level C background and I think this is the approach I would like to see in Rust.
It can be implemented in a way that would be safe to run in kernel space and overall provides a pretty nice programming model. It can also be very performant if it's integrated into the compiler.
Isn't this a rehash of the C10K problem[1] from a decade ago? That was pretty much resolved in favour of single-threading and asynchronous IO, with Nginx and Node.js replacing Apache and Ruby as the platforms that the cool kids use.
So, if threads are the way to go today, what has changed in the last 10 years to turn the conventional wisdom on it's head? 64-bit processors and servers with more memory? Hypervisors/containers? Are threads in Rust more practical than they are in C?
(BTW, these aren't rhetorical questions, genuinely interested.)
Disclaimer: I'm not an expert in this area, merely an interested bystander.
So, AFAIU the C10K approach is still the correct approach. Although perhaps with current hw and Linux being able to decently handle quite a lot of threads since NPTL, one should nowadays call it the C100K problem?
What has changed is a change in focus. Few people do entirely static sites anymore, and for static assets (images etc.) everybody uses nginx or other nonblocking approaches anyway. Perhaps also nginx/etc. as a reverse proxy. So there is no argument that nonblocking approaches are better for handling a lot of potentially slow connections.
But for the "core" functionality of a network application, that people are actually spending times programming rather than using an off the shelf solution like nginx, nonblocking vs. threads doesn't matter that much, there's all kinds of DB calls, CPU intensive work to do, etc. So people are (correctly) asking whether we can create nonblocking code but with an easier to use programming model (to the extent it does matter for performance) or whether to just use threads.
No, backwards: the C10k techniques are the solution to the problem "how do I achieve my performance and scaling goals given the characteristics of the OS, languages, tools available to me today?". The current thread (sic) is asking the question "if we can change the language, tools (perhaps the OS too), what's the best approach?".
And: I have to counterbalance the assertion that Node.js is in any way good for anything besides "I need to code in JS, but not in the browser". I'd rather use VAX assembler and the $QIO syscall, typing on a VT100.
That's exactly the point. So my question remains: if threads are the right answer today, either the C10K folks were wrong or something has changed since then. What?
The question is a different question, hence the answer is different. C10k wasn't considering the option of new languages, this thread is. The same question could have been asked back then (although most folks wouldn't have considered using anything other than C/C++ for high-scale servers then). Perhaps the thing that's changed is more memory and CPU power has made the option of new (and less efficient than C) languages a practical choice for new projects. I know that if, back then, I'd proposed the solution to my product's scaling issues with network I/O to be : "re-write all 500k loc in a different language", I'd have been laughed at.
Well, I believe that it's almost impossible to make rust threads lightweight because every green thread needs a stack anyway. This may be fixed with some stuff like `async/await`.
But let's talk about what's wrong with threads:
1. Timeout handling is ugly: you need to account a timeout in each read and write operation. At least timeout handling makes coroutine/threaded code no better than state machine code. But actually in state machine approach I can set a deadline once and update it only when changed (note to myself: should add such an example to documentation). In Python, it's usually fixed by making another coroutine with sleep and throw an exception to the one doing I/O. It works well, but Rust will never get exceptions (I hope)
2. When you make a server that receive a request, looks in DB then responds, there is an incentive to own (create or acquire from the pool) a DB connection by each thread. This is a sad trick. The better thing when the DB connection is handled by a coroutine on its own. Because in the latter case you may pipeline multiple requests over a connection, monitor if the connection is still alive, reconnect to DB while no requests are active, or the contrary, shut down idle connections. By pipelining you may keep less number of connections to DB so make the load to the database a little bit lower. When I'm talking about DB in this paragraph, of course, I mean everything for which this application is a client. Sure you can do that in threaded code too, but it's much harder to get right. You need two threads per connection (because one reads network and the other looks at the queue and does write), you need to synchronize both sides, connection cleanup code is complex, there is more than one level of timeouts now, so on.
3. You need to avoid deadlocks. Rust takes care to avoid data races, but deadlocks are possible. And they are not always simple or reproducible, so you will have a hard time debugging them. In the single-threaded async code, you are the only user. Even if you have an async thread per processor, you are more likely to own resources instead of locking on them. You may duplicate many things for every thread. You can have more coarse-grained locks, so never hold two of them. But it's almost impossible to write lock-free threaded server.
All of the issues above are neither fixed with async/await nor with any M:N or 1:1 threading approaches.
I've been writing in this model for nearly ten years now. In practice, what you cite as problems aren't.
1: In either approach, somewhere in your event loop you're setting yourself a timeout to fire. Haskell & Erlang do use exceptions, but Go does not, it simply makes this a first-class concern of the core event loop. This is only a problem in languages where the threading was bolted on after-the-fact. Which is a lot of languages, which matter because they have a lot of code. I don't mean to dismiss those real problems. But it's not a fundamental problem, only accidental.
2. In practice, this is not a problem I ever worry about. You get a DB library, it provides pools, unless you're talking to a very, very fast DB (like, memcached on localhost fast) this is one of those cases where IO really does dominate any minor price of thread scheduling.
3. This has been solved for a long time. Go has the nicest little catch phrase with "share memory by communicating instead of communicating by sharing memory", but each of Haskell, Erlang, and Go have their own quite distinct solutions to their problems, and in practice, all of them work. There's other solutions I merely haven't used, but I hear Clojure works, too. (Perhaps arguably a subset of the several approaches Haskell can use. Haskell kind of supports darned near everything, and you can use it all at once.)
This is part of why I write this sort of thing... at its usual glacial pace (despite how much we like to flatter ourselves that we move quickly), the programming community is finally getting around to being really seriously pissed off about how bad threading was in the 1990s. Good. We should be. It sucked. Let us never forget that. But what has not been so well noticed is that the problems with threading have basically been fixed, and in production for a long time now (i.e., not just in theory, but shipping systems; go ask Erlang how long it's been around). You just have to go use the solutions. Don't mistake debates about the minutia of 1:1 OS threading vs. M:N threading and which is single-digit percent points faster than the other for thinking that threading doesn't work.
Lest I sound too pollyannaish about what is still a hard domain, the way I like to put this is that threading has moved from an exponentially complex problem to a polynomially complex problem. (And Rust is leading the way on making even the polynomial have a small number in the exponent.) There's still a certain amount of complexity in making a threaded program go zoom, and it does require some adjustments to how you program, it's not "free", but rather than requiring wizards, it merely requires competent programmers who take a bit of care and use good tools and best practices now.
Somewhat depressing that the earliest "heated discussion" I remember about this was nearly 20 years ago. The outcome of that discussion (with Alan Freier about the M:N and cooperative threading implemented in NSPR at that time) was "if only we could just make threads work". Seems like progress has been pretty slow toward that goal, although Go has picked up the pace recently.
Thread 1 sends message A to thread 2 and waits for a response.
As part of processing message A, thread 2 sends message B to thread 1 and waits for a response... forever, since thread 1 is blocked waiting for thread 2...
A simple example of this would be two threads with two channels to each-other. A is in channel A, B is in channel B. While blocked on channel A, thread 1 can't handle messages in channel B.
Certainly a bit more contrived then deadlock with locks, though (which you can easily do with even a single thread, even in Rust).
One nice thing about Rust is how well suited it can be for embedded use. It is something like a smaller safer easier C++. For embedded you need to bound resources. Small network appliances are a good candidate for async IO. Many threads and stacks with a lot of dynamic allocation are not a good fit for a resource constrained device.
The dynamically-allocating portions of the stdlib may panic on OOM, but in an embedded context you're not even linking that code into your program. And you're free to provide an alternative stdlib that bubbles up OOM for those rare occasions where you want to dynamically allocate and you want to be able to do something sane in the face of OOM and you're on a platform that doesn't overcommit.
Minorish point: a true OOM won't panic, it'll abort the program completely. Some APIs may identify that the requested amount of memory is impossible (would overflow) and panic instead, though.
Aborting once the allocator has actually been invoked is important for perf and exception safety, though we have briefly mused allowing it to panic: https://github.com/rust-lang/rust/issues/26951
OOM is an abort, not a panic (contrary to the official docs, interestingly)
> you're free to provide an alternative stdlib
But like... should you have to?
> for those rare occasions
In kernel programming, allocation failures are common and handling them is essential. To quote my (very talented) classmate, who actually wrote part of a kernel in Rust:
"The only really major issue is with how allocation failure works, which makes rust a (very) poor choice for real kernel development"
This is the status quo (in all languages) if you're doing embedded work: you lose most of the standard library. Rust goes slightly beyond that default by bundling `core`, which is designed for these environments and serves as a building block on which embedded/bare-metal development can flourish. (Cargo works fine with such things, so one can even distribute alternate standard libraries on crates.io.)
> "The only really major issue is with how allocation failure works, which makes rust a (very) poor choice for real kernel development"
I think this was the result of bravely trying very hard to shoe-horn `std` and the inflexible `box` syntax into a kernel environment, somewhere both are 100% not designed for. The recommended approach is to just use `core` and layer non-`std` libs on top of that.
1. In practice most embedded toolchains will give you some of the standard library. A standards compliant C++ freestanding library provides new and delete, for example.
2. The libcore allocator API is marked unstable, so even if you go through the trouble of implementing it, how long will it last?
That's what libcore is. A lot of the stdlib is reexports over libcore.
new and delete are C++ isms, their Rust counterparts are `Box::new` (and destructors are automatic). `Box::new`, like `new`, has the OOM issue. If you're okay with that, you are free to link to the `alloc` crate, which gives you `Box` without pulling in additional deps.
There are plans for in-place boxing, and AFAICT that API will be available to embedded users.
Just to be clear, the C++ equivalent of Box is unique_ptr. I don't see equivalents of new and delete, but I might be missing them.
I see Box::from_raw and core::ops::Placer. Is that what you mean by in-place boxing? But now there are two problems:
You have to manually allocate a buffer of the right size, check if the pointer is null, and box it. For every allocation. Forget a null check? Allocate too small a buffer? Guess what, you've got undefined behavior!
It also doesn't generalize to other parts of the standard library. Did you want to use Rust's built in containers in your kernel? Well, sorry, you're going to have to write your own ones that don't panic on allocation failure.
> Just to be clear, the C++ equivalent of Box is unique_ptr.
Yeah, I mean that it's used like `new` is. We don't have the notion of constructors.
> You have to manually allocate a buffer of the right size, check if the pointer is null, and box it. For every allocation.
No. With the placement API, you'll be able to use the `box` syntax (which at the moment just supports `Box`, which has OOM issues) for your custom wrapper (i.e., `Box` with OOM support, or whatever).
What `box` gives us is that `let x = box make_inner()` would have the returned value of `make_inner()` written directly to the heap location. (Unlike `let x = Box::new(make_inner())`, which may be a move of `make_inner()`)
> Did you want to use Rust's built in containers in your kernel? Well, sorry, you're going to have to write your own ones that don't panic on allocation failure.
When you're writing a kernel you're not supposed to be using the standard library at all, just libcore (which exists for this purpose). The same issue exists in C++.
>When you're writing a kernel you're not supposed to be using the standard library at all, just libcore (which exists for this purpose). The same issue exists in C++.
Not quite. C++ standard containers are polymorphic on the allocator. That means you're welcome to use std::list or std::string in kernel mode. Just plug in a kernel allocator and you're good to go.
There's no technical reason why every Rust core project must reinvent the linked list and the hash map, like every C project.
1. Yes, as I just said, that's essentially libcore.
It doesn't literally offer working dynamic allocations, but I don't see how a cross-platform and cross-use-case freestanding library can do that, too many different environments/constraints to encode an built-in allocator. In fact, core says nothing about how allocation has to work or even if it needs to exist.
The rustc distribution has the additional `alloc` and `collections` crates layered on top of `core`, which offers some extra functionality when allocation does exist (without assuming other OS support, like IO), and the allocator used is pluggable.
(Do you have an example of a standards complaint C++ freestanding standard library?)
2. There will be some way to do this long term. I can offer no guarantees that it will be what's there right now, but something will exist. This is also something that has been on the radar for a long time, but isn't core to the language. In any case, what is in core is essentially just optimising stack usage, all of the sematics it offers can be done with normal function calls, e.g. `Box::new` is literally just a normal function and perfectly implementable in any environment, if you have some allocation routine (the compiler/language doesn't need to know about either). So the unstable libcore functionality is important and desirable, but not killer.
There's also the meaning of "allocators" as in allowing `Vec` (etc.) to use some other algorithm than what Rust does by default; again something desirable and been on the radar for a while, but not key to the language. I don't think this needs any new language features, just people to explore the library space. (NB. this is different to what libcore offers now, which is basically what is called "placement new" in C++.)
In other words, freestanding C++ allows you to to use new and delete normally (after providing malloc and free). Note that this a minimum requirement. There's nothing stopping you from using STL containers in embedded systems, and people do (especially now that STL containers can take user defined allocators).
It seems like Rust is trying to achieve the same result with libcore. That's good! Thanks for clarifying.
But now you lose all the nice parts of the standard library, and probably much of the safety. If you want to use Box, you have to manually do the allocation, check the pointer, and construct the box. Buffer too small? Forget the pointer check? Welcome to undefined behavior.
If you want a container, you have to roll your own. If you want a container that takes user-defined allocators, I'm not even sure what'd you do.
As you say, the libsupc++ requires you to provide malloc and free, just like libcore.
> probably much of the safety
Only if you want to make life hard for yourself. The standard library is not at all special in its ability to create safe abstractions for `unsafe` code, and doing this makes things so much smoother: you don't have to scatter `unsafe` all over your code, and you let the compiler help you as much as possible. Also, I think there's a lot of nice stuff in `core` too; even string formatting works.
You can still implement equally safe version of them tuned for your environment or, more likely (as the ecosystem develops), use a crate that someone else has written that does it for you. Here's the start of an implementation of a version of `Box` that returns `Result` instead of crashing on allocation failure:
extern crate core;
use core::{mem, ops, ptr};
extern {
fn malloc(size: usize) -> *const ();
fn free(p: *const ());
}
pub struct MyBox<T> {
p: *const T,
}
impl<T> MyBox<T> {
fn new(x: T) -> Result<MyBox<T>, ()> {
unsafe {
let p = malloc(mem::size_of::<T>()) as *const T;
if p.is_null() {
Err(())
} else {
ptr::write(p as *mut T, x);
Ok(MyBox { p: p })
}
}
}
}
impl<T> Drop for MyBox<T> {
fn drop(&mut self) {
unsafe {
drop(ptr::read(self.p));
free(self.p as *const _);
}
}
}
// allow `*` to work
impl<T> ops::Deref for MyBox<T> {
type Target = T;
fn deref(&self) -> &T { unsafe { &*self.p } }
}
This is a safe interface, and is a simplified version of how the built-in box is defined in `std` (well, `alloc`). I've completed the rest of the core functionality `Box` offers, including a demonstration at the bottom: https://play.rust-lang.org/?gist=77f2b15bcc4273846140&versio...
The whole of Rust's standard library (including `Box`, all the containers, all of IO) is built via this process: wrap the raw `unsafe` functionality into safe interfaces that manage things like checking `malloc`'s return value, bounds checking buffer accesses.
> If you want a container that takes user-defined allocators, I'm not even sure what'd you do.
Do exactly what C++ does: have some interface (a trait, in Rust) that an allocator needs to satisfy and have the collections take a type parameter that implements that interface. We "already" know what this looks like in Rust (broadly speaking):
The hardest part and sticking point for including this in `std`/`core` is working out a good interface, because it will live forever, needs to cover as much of the problem space as possible and have accommodation for some possible future expansions to the language/library. This isn't such a problem for a container library outside `std`, which is much more flexible because it can be versioned independently, at whatever pace is necessary.
An out-of-tree version aimed at a single use-case (e.g. embedded development) is likely even easier, since the problem space is smaller.
But doesn't that seem kind of redundant? Box and std containers are great as is, except for the one line that aborts on OOM. Would it be possible to use traits to choose OOM behavior at compile time?
To be clear, this isn't just important for embedded. A production database or web server should be able to handle an allocation failure without blowing up the whole process.
And I'm glad there's some talk about a standard allocator trait. I'll look for the relevant issue.
There could definitely be an alternate method on `Box` that returned Result, but I don't think this is so feasible for containers which allocate in many places and so would require duplicating most of the API, however there may be some tricks that work.
Right, duplicating every method that allocates also isn't very elegant.
But could Box and containers be polymorphic on an OOM handling trait?
Alternatively, one could imagine an OOM-safe container that returns results, and a convenience wrapper class that unwraps them. That might harm efficient code generation though.
The docs (which are generally very good, especially for a language this young) should:
1. Distinguish between abort and panic.
2. Explain that allocation failure is an abort.
3. Explain that a vec can allocate up to twice its nominal size, which is a common cause of OOMs. To their credit, the docs do explain that vec over-allocates, but not that it allocates double the memory. They could also explain that you don't get that memory back unless you ask for it, even if you pop() or remove() elements.
Scenario:
I'm a new systems programmer excited to try Rust (fast and safe? wow!). I have 8GB of memory in my computer. I push ~4GB of data into a vector so I can process it, something I'm used to doing in Python.
POW! My program exits with no error message. What happened? I look at the error handling docs:
It would be much nicer to use coroutines instead of state machines. This means that you could write code as if it were doing i/o in a "blocking" fashion, but behind the scenes it is actually asynchronous.
I think a lot of folks think of Network IO when they say Asynchronous IO. That's only half the story, unless you're just building proxies and caches you have to deal with Disk IO at some point in time. And, async disk IO is horrible in every OS / language.
Well, apparently not on Windows. But this is a kernel interface issue not a language issue, and clearly needs fixing now we have SSDs where having thousands of outstanding requests is useful, if not required for performance, unless the hardware APIs are going to change (maybe if they get memory interfaces this does change).
> But this is a kernel interface issue not a language issue,
Absolutely agree! The thing that surprises me most about Rust and Go in particular is that they're trying to solve an OS problem with new language abstractions. You don't need a new language -- what you really want is thread-agnostic I/O primitives, thread-agnostic "threadpool" primitives, and completion-oriented APIs that provide both synchronous and asynchronous primitives. Basically, you want what Windows has.
The async story on windows is better, but not great. In many cases the async calls on windows silently block due to a number of special cases (that are not so special).
Those are all entirely reasonable things to block for if you understand the mechanics behind why those calls may block. And none of it should be fast-path stuff; it's easy to architect a solution where you avoid those things except in corner cases.
The other thing you're missing is that when you architect around Windows completion ports and threadpool I/O facilities, it doesn't matter if one thread blocks every so often. Windows will detect this and schedule another one to run, such that there is one active thread scheduled on every core.
I/O completion ports facilitate thread-agnostic I/O completion; the thread that blocked because it was extending an encrypted file won't impede the latency of other clients because there is no thread-specific association.
AFAIK, the kernel APIs only deal with abstract handles and descriptors - they aren't written specifically for disk IO or network IO. And C#'s async/await and BeginXXX also work fine with all kinds of IO.
On Unix, the APIs for async disk I/O are completely different from the APIs for async network I/O. This is because, on Unix, the APIs for network I/O are "readiness"-based (they tell you when data is available in the read buffer or space is available in the write buffer), but a disk file handle is always ready (except at EOF). So, for disk I/O you need a "completion" model, where you queue an operation and then get a callback when it is done. This is a very different kind of API, and as a result it does not fit well with the network I/O APIs, but of course if you are doing async you probably need to do both disk and network at the same time, so you need the APIs to play nice with each other (you need a single event loop that waits for both kinds of events).
(On Windows, as I understand it, you can do completion-based I/O on all handle types.)
And then, even if you manage to use async disk I/O, you can only really use it for reads and writes. Filesystem calls that manipulate the directory tree usually don't have async versions. At some point you have to give up and do things in threads.
There are some programming languages that have sufficiently impoverished semantics that this is their best effort that they can make towards concurrency.
But this is Rust. It's the language that fixes all the reasons to be afraid of threading in the first place. What's actually wrong with threads here? This isn't Java. And having written network servers with pretty much every abstraction so much as mentioned in the article, green threads are a dream for writing network servers. You can hardly believe how much accidental complexity you're fighting with every day in non-threaded solutions until you try something like Erlang or Go. Rust could be something that I mention in the same breath for network servers. But not with this approach.
There's plenty to debate in this post and I don't expect to go unchallenged. But I would remind repliers that we are explicitly in a Rust context. We must talk about Rust here, not 1998-C++. What's so wrong with Rust threads, excepting perhaps them being heavyweight? (Far better to solve that problem directly.)