> I fully expect AMD to release 128-core CPUs once they're on that process. A quad-socket server with those will have 512 cores or 1,024 threads. Completely lock-free algorithms will be the only way to scale up to just one server at that scale!
How often will we need that scale in a single address space?
It's certainly desirable for hypervisors/kernels, database servers, perhaps L4 load balancers and reverse HTTP proxies. Those aren't nothing but far more people work on web application servers than all of them put together.
Web application servers are often written to be stateless (with the possible exception of caches) so they can scale to multiple machines. That's important for high-reliability sites even if they aren't large enough to fully saturate a single huge machine like that. As long as you need to load-balance between machines, it's not a big problem to also run multiple instances per machine. If the application scales well to 32 cores, run 16 of them per 512-core machine. Seems a lot easier than going to extraordinary efforts to make one address space scale...
Even if you have 1024 separate processes not sharing much of anything, there are still locks in the kernel running them.
For example, a pair of threads (inside one of those 1024 processes) synchronising with each other will often go through the kernel to do so. In Linux this uses the futex syscall; Windows etc have similar. If to do that the kernel takes a lock that is shared with other processes, even if just for a moment, even if it's hashed on address and memory space, even if it's a spinlock and there's little contention, that lock causes memory traffic between multiple cores and separate processes.
Same for processes that are reading the same files as other processes, or (for example) running in the same directory when doing path lookups. There's a lot of work done in Linux to keep this scalable (RCU), but it's easy to hit scaling barriers that nobody has tested or designed for yet. Once 1024 core CPUs are common, of course the kernel will be optimised for that.
Yes, I included the kernel in my list of things that are desirable to scale well for that reason.
That said, in some cases I don't think it's strictly necessary for even the kernel to scale well as long as you have a hypervisor that does. It's not unusual to deploy software in VMs on a cluster. Having more, smaller VMs per machine is a way to handle poor kernel scalability, just as I suggested for the web application server. VMs are higher-overhead than multiple containers on a single kernel, so this wouldn't be my first choice, but many people use VMs anyway.
How often will we need that scale in a single address space?
It's certainly desirable for hypervisors/kernels, database servers, perhaps L4 load balancers and reverse HTTP proxies. Those aren't nothing but far more people work on web application servers than all of them put together.
Web application servers are often written to be stateless (with the possible exception of caches) so they can scale to multiple machines. That's important for high-reliability sites even if they aren't large enough to fully saturate a single huge machine like that. As long as you need to load-balance between machines, it's not a big problem to also run multiple instances per machine. If the application scales well to 32 cores, run 16 of them per 512-core machine. Seems a lot easier than going to extraordinary efforts to make one address space scale...