I was always disappointed by the performance of fork()/clone().
CompSci class told me it was a very cheap operation, because all the actual memory is copy-on-write, so its a great way to do all kinds of things.
But the reality is that duplicating huge page tables, and hundreds of file handles is very slow. Like 10's of milliseconds slow for a big process.
And then the process runs slowly for a long time after that because every memory access ends up causing lots of faults and page copying.
I think my CompSci class lied to me... it might seem cheap and a neat thing to do, but the reality is there are very few usecases where it makes sense.
CS classes (and, far too often, professional programmers) talk about computers like they're just faster PDP-11s with fundamentally the same performance characteristics.
Agreed that these costs can be larger than is perhaps implied in compsci classes (though it's possible that they've changed their message since I took them!)
I suppose it is still essentially free for some common uses - e.g. if a shell uses `fork()` rather than one of the alternatives it's unlikely to have a very big address space, so it'll still be fast.
My experience has been that big processes - 100+GB - which are now pretty reasonable in size really do show some human-perceptible latency for forking. At least tens of milliseconds matches my experience (I wouldn't be surprised to see higher). This is really jarring when you're used to thinking of it as cost-free.
The slowdown afterwards, resulting from copy-on-write, is especially noticeable if (for instance) your process has a high memory dirtying rate. Simulators that rapidly write to a large array in memory are a good example here.
When you really need `fork()` semantics this could all still be acceptable - but I think some projects do ban the use of `fork()` within a program to avoid unexpected costs. If you really have a big process that needs to start workers I guess it might be worth having a small daemon specifically for doing that.
Right, shells are no threaded and they tend to have small resident set sizes. Even in shells though, there's no reason not to use vfork(), and if you have a tight loop over starting a bunch of child processes, you might as well use it. Though, in a shell, you do need fork() in order to trivially implement sub-shells.
Also, mandating copy-on-write as an implementation strategy is a huge burden to place on the host. Now you’ve made the amount of memory a process is is using unquantifiable.
It's not necessarily unquantifiable -- the kernel can count the not-yet-copied pages pessimistically as allocated memory, triggering OOM allocation failures if the amount of potential memory usage is greater than RAM. IIUC, this is how Linux vm.overcommit_memory[1] mode 2 works, if overcommit_ratio = 100.
However, if an application is written to assume that it can fork a ton and rely on COW to not trigger OOM, it obviously won't work under mode 2.
> 2 - Don't overcommit. The total address space commit for the system is not permitted to exceed swap + a configurable amount (default is 50%) of physical RAM.
> Depending on the amount you use, in most situations this means a process will not be killed while accessing pages but will receive errors on memory allocation as appropriate.
> Useful for applications that want to guarantee their memory allocations will be available in the future without having to initialize every page.
You're right, "unquantifiable" was the wrong word here. I meant, a program has no real way of predicting/reacting to OOM. I didn't realize mode 2 with overcommit_ratio = 100 behaved that way, thanks for sharing.
Yeah I think in a practical sense you're right, since AFAIK using mode 2 is fairly rare because most software assumes overcommit, and even if a program is written with an understanding that malloc can return NULL, its in the sense of
POSIX doesn't require that fork() be implemented using copy-on-write techniques. An implementation is free to copy all of the parent's writable address space.
If the parent is a JVM, for sure. But a copy-on-write fork() still doesn't perform well. The point isn't to just copy the whole parent. The point is to stop copying at all.
Copy-on-write is supposed to be cheap, but in fact it's not. MMU/TLB manipulations are very slow. Page faults are slow. So the common thing now is to just copy the entire resident set size (well, the writable pages in it), and if that is large, that too is slow.
CompSci class told me it was a very cheap operation, because all the actual memory is copy-on-write, so its a great way to do all kinds of things.
But the reality is that duplicating huge page tables, and hundreds of file handles is very slow. Like 10's of milliseconds slow for a big process.
And then the process runs slowly for a long time after that because every memory access ends up causing lots of faults and page copying.
I think my CompSci class lied to me... it might seem cheap and a neat thing to do, but the reality is there are very few usecases where it makes sense.