The nice thing about virtual memory is that it's, well, virtual. It costs you almost nothing until you've touched it. (Fun exercise for the reader: measure the kernel overhead for an unused 1 TiB VMA.) But creating huge spaces--that terabyte mmap wasn't theoretical--that stay untouched is hugely algorithmically useful, especially for things like malloc implementations.
Why does it bother people? Two reasons. First is mlock to avoid swap. This is solvable in much better ways--I'm a fan of disabling swap in many cases anyway. Second is that, absent cgroups, it's difficult to put hard limits on memory usage in Linux. So people, looking under the streetlight, put limits on virtual usage, even though that's not what they care about limiting! Then they get angry when you break it. My refrain here, as in many cases (see for example measuring process CPU time spent in kernel mode): "X is impossible" doesn't justify Y unless Y correctly solves the problem X does.
(I spent years in charge of a major memory allocator so this is a battle I've fought too many times.)
Thirdly, the kernel will kill you if your overcommit ratio is too high. I had this argument with the Go folks several years ago (when Docker would crash after starting 1000 containers because the Go runtime had allocated 8GB of virtual memory while only a tens of MB were in use and the kernel freaked out).
You're right that it doesn't cost anything, other than the risk that a process can cripple your machine using its overcommitted memory mapping. And so the kernel has protections against this, which should deter language runtime developers from doing this.
And let's not forget that MADV_DONTNEED is both incorrectly expensive on Linux and ridiculously expensive compared to freeing memory and reallocating it when you need it. Bryan Cantrill ranted about this for a solid half an hour in a podcast a year or two ago.
So… does that mean the linux kernel will blow a gasket if I mmap actually large files to play with them but have almost no resident memory? That doesn't seem reasonable.
> What do you mean by “free” memory? Actually unmap it?
Sorry, I didn't phrase it well. MADV_DONTNEED is significantly more expensive than most ways that memory allocators would "free" memory. This includes just zeroing it out in userspace when necessary (so no need for a TLB modification), or simply unmapping it and remapping it when needed.
> Also, I assume the crippling you’re talking about here is just the ability to rapidly apply memory pressure?
Right, and if the memory is overcommitted then you can cause OOM very trivially because you already have more mapped pages than there is physical memory -- writing a byte in each page will cause intense memory pressure. Now, this doesn't mean that it would kernel panic the machine, it just means it would cause issues (OOM would figure out what process is the culprit fairly easily).
This is why vm.overcommit_ratio exists (which is what I was talking about when it comes to killing a machine) -- though I just figured out that not all Linux machines ship with vm.overcommit_memory=2 (which I'm pretty sure is what SUSE and maybe some other distros ship because this is definitely an issue we've had for several years...).
There's also RLIMIT_AS, which applied regardless of overcommit_memory.
Right. I’m very familiar with all these mechanisms, I guess I just don’t agree that the ability to cause an OOM, particularly if applications are isolated in cgroups appropriately, is a big deal. On balance, not allowing applications to use virtual memory for useful things (such as the Go case of future heap reservation) or underutilizing physical memory seems worse.
As an aside, it seems like an apples and oranges comparison to compare “freeing” by zeroing (which doesn’t free at all) to MADV_DONTNEED. I’m also pretty sure that munmap will be much slower than MADV_DONTNEED, or at least way less scalable, given that it needs to acquire a write lock on mmap_sem, which tends to be a bottleneck. It does seem like there’s a lot of opportunity for a better interface than MADV_DONTNEED though (e.g. something asynchronous, so you can batch the TLB flush and avoid the synchronous kernel transition).
> particularly if applications are isolated in cgroups appropriately
Once the cgroup OOM bugs get fixed, amirite? :P
> It does seem like there’s a lot of opportunity for a better interface than MADV_DONTNEED though (e.g. something asynchronous, so you can batch the TLB flush and avoid the synchronous kernel transition).
The original MADV_DONTNEED interface, as implemented on Solaris and FreeBSD and basically every other Unix-like does exactly this -- it tells the operating system that it is free to free it whenever it likes. Linux is the only modern operating system that does the "FREE THIS RIGHT NOW" interface (and it's arguably a bug or a misunderstanding of the semantics -- or it was copied from some really fruity Unix flavour).
In fact, when jemalloc was ported to Solaris it would crash because MADV_DONTNEED was incorrectly implemented on Linux (and jemalloc assumed that MADV_DONTNEED would always zero out the pages -- which is not the case outside Linux).
> As an aside, it seems like an apples and oranges comparison to compare “freeing” by zeroing (which doesn’t free at all) to MADV_DONTNEED. [...] I’m also pretty sure that munmap will be much slower than MADV_DONTNEED.
This is fair, I was sort of alluding to writing a memory allocator where you would prefer to have a memory pool rather than constantly doing MADV_DONTNEED (which is sort of what Go does -- or at least used to do). If you're using a memory pool, then zeroing out the memory on-"allocation" in userspace is probably quite a bit cheaper than MADV_DONTNEED.
But you're right that it's not really an apt comparison -- I was pointing out that there are better memory management setup than just spamming MADV_DONTNEED.
The thing is, people want a way to measure and control the amount of memory that a process uses or is likely to use. Resident memory is one way to measure actually used memory, but from the man 3 vlimit, RLIMIT_RSS is only available on linux 2.4.x, x < 30; which nobody in their right mind is still running. So we have RLIMIT_AS which limits virtual memory, or we have the default policy of hope the OOM killer kills the right thing when you run out of ram.
That you have to keep fighting this battle is an indication that people's needs (or desires) aren't being well met.
There's a third reason: trying to allocate too much virtual memory on machines with limited physical memory will fail on Linux with the default setting vm.overcommit_memory=0. See for instance https://bugs.chromium.org/p/webm/issues/detail?id=78
Great points. A third reason is core files: That 1 TB of unused virtual memory will be written out to the core file, which will take forever and/or run out of disk. This is part of the problem of running with the address sanitizer: you don't get core files on crashing, because they'd be too big.
Not sure whether anyone is writing iOS apps in go, but iOS refuses to allocate more than a relatively small amount of address space to each process (a few gigs, even on 64-bit devices).
I used to think this. Then I deployed on windows. Virtual memory can’t exceed total physical memory (+pagefile) or else malloc() will fail. I am currently having an issue where memory is “allocated” but not used causing software crashes. Actual used memory is 60% of that.
The page file in Windows can grow and the max size, I believe, is 3 times the amount of physical memory in the machine. So, if you're trying to commit more than [Physical Memory x 4] bytes, then yes, it will fail. But, more than likely, you'll get malloc failures long before that due to address space fragmentation (unless you're doing one huge chunk).
The nice thing about virtual memory is that it's, well, virtual. It costs you almost nothing until you've touched it. (Fun exercise for the reader: measure the kernel overhead for an unused 1 TiB VMA.) But creating huge spaces--that terabyte mmap wasn't theoretical--that stay untouched is hugely algorithmically useful, especially for things like malloc implementations.
Why does it bother people? Two reasons. First is mlock to avoid swap. This is solvable in much better ways--I'm a fan of disabling swap in many cases anyway. Second is that, absent cgroups, it's difficult to put hard limits on memory usage in Linux. So people, looking under the streetlight, put limits on virtual usage, even though that's not what they care about limiting! Then they get angry when you break it. My refrain here, as in many cases (see for example measuring process CPU time spent in kernel mode): "X is impossible" doesn't justify Y unless Y correctly solves the problem X does.
(I spent years in charge of a major memory allocator so this is a battle I've fought too many times.)