8-byte mallocs are expensive because most mallocs have per-allocation memory overhead to track things. This is exactly why you may want to use a different allocator.
IOW, you're angle is that instead of finding a solution to the problem, instead choose a different problem. You don't always have that luxury.
My background on this problem is compilers. Compilers allocate lots of little structures that represent tree nodes, values, tokens, etc. Forcing them all to be a minimum of 32 or 64 bytes in size on the basis that would justify using malloc for them, would be more than a little bizarre. Arena allocation - both per module (for structures that need to persist for the whole compilation) and stack based (for structures that are discarded after e.g. evaluation or codegen) - makes far more sense than contorting the problem so that malloc makes sense.
> 8-byte mallocs are expensive because most mallocs have per-allocation memory overhead to track things. This is exactly why you may want to use a different allocator.
I guess if the objective of the article is to point out the obvious fact that mallocing 8 bytes at a time is a bad idea, then showing up some actual numbers from actual malloc implementations is a good idea. However, even small objects in practical problems are usually bigger than 8 bytes, so making the allocation size a bit bigger would give more realistic figures.
Overall I think the article was informative and well written but more realistic test case would better point out when to write a custom allocator and what allocator to choose for a particular usage pattern.
The objective of the article wasn't to stress the specific case of the 8-bytes allocation pattern, it was about showing that malloc behavior depends on the context. The size of 8 bytes had been chosen because it was small enough to allow a large number of allocation to be performed so that timing were quite accurate in the results, however the main goal was to show the difference between was are called the "contended" and the "uncontended" case: your program may perform properly with single-threaded workload, but poorly in a multi-threaded environment, not because of explicit locking in your code but because some resource sharing is hidden behind the allocator.
Also, the choice of lot of allocations + lot of deallocations pattern was chosen because this is an issue we ran into quite recently: we allocated a huge tree structure progressively and sometimes we flushed it to disk. The flush was quite efficient, but the deallocation blocked the program during approximatively 30s. As a quick fix, we put the deallocation in background, but this slowed the tree construction down by approximatively 50% because of the contention of the allocator. Even if the size of the chunks in the article were not realistic, the results were near-realistic enough to be considered publishable.
I provided (in a separate comment) the result of the benchmarks with a 32-bytes payload (in that benchmark, all the allocators had the same 12% overhead in term of memory, but we clearly see the same performance pattern as with the 8-bytes payload).
I kinda see it more as "we're using an unrealistic workload to emphasize the per-malloc overhead". A program that was written to minimize and batch allocations wouldn't be as demonstrative.
Contra "exDM69", there's nothing unreasonable or "unrealistic" about allocating 8 bytes at a time; it's in fact extremely convenient to be able to do that. I think it's telling that someone would call that workload unrealistic; it indicates to me that they've never even really considered alternatives to malloc.
It's a little like those C programmers who try to use 256 byte static arrays for all their strings because they don't quite grok malloc/realloc.
Well, personally, if I'm using C, it's typically for a very narrowly defined, performance-sensitive problem. Both times I've done this in the last couple years, I found myself doing a couple big mallocs at the start and then running some tight loops over that memory with no further mallocs -- for use cases with more allocation, I'll just use whatever other language is more convenient, preferably one with a GC.
But if I were writing a bigger application entirely in C/C++, I could totally see lots of small allocations happening at different points (and getting into wackiness with custom pooled allocators and auto_ptr).
That's essentially what I do in much of my C code: implement pool allocators. And that's essentially what the post we're commenting on is talking about.
Incidentally, good common case for 8-byte allocations that is actually common in real-world C code: 32 bit linked list nodes (4 bytes nextptr, 4 bytes dataptr).
Hmm, are you sure that this (linked lists with small elements) is a good example which you should tell people about?
Linked lists are bad because they have big overhead (8 bytes for every element in your case - which dominates by a large margin if you store an int in each node for example) and they are really bad for CPU caches (they have very little spatial locality), thus slowing down your code.
I much rather prefer the "growing array" approach.
Yeah, but they have some unique characteristics related to inserting/deleting from the middle or beginning of the list. They're actually used in a few places in the linux kernel, for performance reasons, in spite of cache locality issues.
That is similar to the signal processing application I worked on a couple years ago. It was a real-time application with fixed sizes at each point in the algorithm, so we knew maximum memory usage at initialization. Thus, we allocated all memory up-front and continuously reused the buffers until we were shut down. Under the operational architecture we were moving to when I left, our application never actually called malloc at all; the control program that ran the signal processor and which started us malloced massive segments of memory on each machine and gave it to us, which we then "allocated" to our own buffers.
> 8-byte mallocs are expensive because most mallocs have per-allocation memory overhead to track things
As you say, most mallocs have this overhead. But with a bit of alignment trickery and bit masking you can put the header storage at the start of a page, bringing per-allocation overhead down to just a few bits.
IOW, you're angle is that instead of finding a solution to the problem, instead choose a different problem. You don't always have that luxury.
My background on this problem is compilers. Compilers allocate lots of little structures that represent tree nodes, values, tokens, etc. Forcing them all to be a minimum of 32 or 64 bytes in size on the basis that would justify using malloc for them, would be more than a little bizarre. Arena allocation - both per module (for structures that need to persist for the whole compilation) and stack based (for structures that are discarded after e.g. evaluation or codegen) - makes far more sense than contorting the problem so that malloc makes sense.