It’s possible to write many other bump algorithms, including ones that bump upwards but generate much much better code than OP’s forward upward bump. In particular, no overflow checks are needed if you write it carefully enough.
My favorite is one where I only use subtraction but the direction of allocation is in the positive direction (I subtract a `remaining` counter, and the returned address is `end - remaining`).
But I have seen many others. A few of my colleagues have similarly geeked out on this and come up with splendid bumpers (ggaren wrote a great one in bmalloc, and I remember the MMTk folks put hella thought into theirs).
In most situations, the allocation size is a constant and bumping upwards can be done without an overflow check because the region cannot be close enough to the upper part of memory to wrap around.
I'm surprised that there was no discussion of memory system performance. There are tons of OS-level and hardware prefetcher optimizations for forward-marching pointer references and zero-page allocation.
You can bump upwards without an overflow check even if the size is variable.
I’ve seen multiple ways to do it. And I’ve lost days of my life to coming up with pointlessly amusing variants that get there in weird ways.
Yeah I’m also amused that they didn’t get into the fact that downward bump is just not what the HW expects you to do. You’d need a real benchmark to see that effect. And, it’s an open question whether this would matter unless your allocation rate was obscene (if it isn’t then most scans of memory will be normal loops and not the allocator).
Because stacks aren’t unbounded. You grow and shrink them. Therefore, regardless of which direction it grows in, it also shrinks in the opposite direction with the same likelihood.
> fact that downward bump is just not what the HW expects you to do.
What fact is this? and what does it have to do with stack being bounded? (bounds of stack are unknown to most micro architectural state and only stored in OS structures).
HW prefetchers support access pattern in either direction, so do load/store instructions (stm(fd)/ldm(fd); ltp/stp;)
The fact that stacks are bounded means you both push and you pop. Just as frequently as a function calls another one resulting in memory ops moving to lower addresses, functions also return causing memory ops to move back to higher addresses. Pushes and pops are balanced. Why do we know they are balanced? Because the stack is bounded.
So - the direction of stack growth isn’t interesting to prefetching strategy.
Phil style: track end and remaining. End doesn’t change. To allocate size, check if lesseq remaining and subtract from it. End minus remaining is the result (using the old value of remaining). Has two variants - one that burns an extra register but saves an instruction. Empirically, this one is fast enough that I just use it all the time.
Bmalloc style bump: independently subtract from remaining and add to bump. It’s more instructions but they are instruction level parallel.
Another friend’s bump style: have bump/end ptrs like usual. Fast path checks if size < end - bump. Then you add to bump like usual.
Note that instruction counts on these don’t matter so much on modern CPU’s. But dependency chain length does matter, a lot. Register usage matters only if you want to inline the allocator (and it’s not always beneficial to do it).
My favorite is one where I only use subtraction but the direction of allocation is in the positive direction (I subtract a `remaining` counter, and the returned address is `end - remaining`).
But I have seen many others. A few of my colleagues have similarly geeked out on this and come up with splendid bumpers (ggaren wrote a great one in bmalloc, and I remember the MMTk folks put hella thought into theirs).