The thread test is janky. Most multithreaded allocators optimize for the (common) case that objects are freed by the same thread that creates them. When objects are freed by different threads than the ones that allocated them, typically some sort of slow-path is invoked.
Older allocators with per-thread caches used to behave very badly with cross-thread frees, accumulating tons of freed objects in threads that didn't necessarily allocate a lot of objects. Tcmalloc uses a garbage collection process to move those objects back to the central free list.
The test in the article, where one thread does all the allocations and another does all the frees, basically subverts the thread-caching in tcmalloc, and just tests how quickly the garbage collection process can move freed objects from the free()-thread's cache back to the central heap where they can be reused by the malloc()-thread.
I admit that test stress some corner cases (at least some cases that the allocator designer consider as corner cases). That said, malloc has no choice but supporting that use case.
A use case for such pattern is a message-posting with workers: you queue some messages that are later unqueued and processed by a different thread. This is an increasingly common pattern in modern programs. In that pattern the message is allocated in one thread (let say the main one) and processed then deallocated by another thread.
If your implementation of message allocation is malloc-based, then you will stress the exact same code paths the benchmark is stressing.
You're not wrong that malloc-based message passing causes that load on malloc, but if performance of the message-passing code is important, you'd want to use a ring buffer anyway - cross-CPU or not, malloc is pretty slow.
It should be obvious that a lot of 8 byte mallocs will give bad performance and horrible memory use. This article and in particular the benchmarks in it would be a lot more informative if the test case was more realistic.
Please add at least 32 or 64 bytes of payload to the linked list structure and re-run the benchmarks. Even that is a very small allocation block, but is on the lower end of realistic allocation sizes although not a good practice.
8-byte mallocs are expensive because most mallocs have per-allocation memory overhead to track things. This is exactly why you may want to use a different allocator.
IOW, you're angle is that instead of finding a solution to the problem, instead choose a different problem. You don't always have that luxury.
My background on this problem is compilers. Compilers allocate lots of little structures that represent tree nodes, values, tokens, etc. Forcing them all to be a minimum of 32 or 64 bytes in size on the basis that would justify using malloc for them, would be more than a little bizarre. Arena allocation - both per module (for structures that need to persist for the whole compilation) and stack based (for structures that are discarded after e.g. evaluation or codegen) - makes far more sense than contorting the problem so that malloc makes sense.
> 8-byte mallocs are expensive because most mallocs have per-allocation memory overhead to track things. This is exactly why you may want to use a different allocator.
I guess if the objective of the article is to point out the obvious fact that mallocing 8 bytes at a time is a bad idea, then showing up some actual numbers from actual malloc implementations is a good idea. However, even small objects in practical problems are usually bigger than 8 bytes, so making the allocation size a bit bigger would give more realistic figures.
Overall I think the article was informative and well written but more realistic test case would better point out when to write a custom allocator and what allocator to choose for a particular usage pattern.
The objective of the article wasn't to stress the specific case of the 8-bytes allocation pattern, it was about showing that malloc behavior depends on the context. The size of 8 bytes had been chosen because it was small enough to allow a large number of allocation to be performed so that timing were quite accurate in the results, however the main goal was to show the difference between was are called the "contended" and the "uncontended" case: your program may perform properly with single-threaded workload, but poorly in a multi-threaded environment, not because of explicit locking in your code but because some resource sharing is hidden behind the allocator.
Also, the choice of lot of allocations + lot of deallocations pattern was chosen because this is an issue we ran into quite recently: we allocated a huge tree structure progressively and sometimes we flushed it to disk. The flush was quite efficient, but the deallocation blocked the program during approximatively 30s. As a quick fix, we put the deallocation in background, but this slowed the tree construction down by approximatively 50% because of the contention of the allocator. Even if the size of the chunks in the article were not realistic, the results were near-realistic enough to be considered publishable.
I provided (in a separate comment) the result of the benchmarks with a 32-bytes payload (in that benchmark, all the allocators had the same 12% overhead in term of memory, but we clearly see the same performance pattern as with the 8-bytes payload).
I kinda see it more as "we're using an unrealistic workload to emphasize the per-malloc overhead". A program that was written to minimize and batch allocations wouldn't be as demonstrative.
Contra "exDM69", there's nothing unreasonable or "unrealistic" about allocating 8 bytes at a time; it's in fact extremely convenient to be able to do that. I think it's telling that someone would call that workload unrealistic; it indicates to me that they've never even really considered alternatives to malloc.
It's a little like those C programmers who try to use 256 byte static arrays for all their strings because they don't quite grok malloc/realloc.
Well, personally, if I'm using C, it's typically for a very narrowly defined, performance-sensitive problem. Both times I've done this in the last couple years, I found myself doing a couple big mallocs at the start and then running some tight loops over that memory with no further mallocs -- for use cases with more allocation, I'll just use whatever other language is more convenient, preferably one with a GC.
But if I were writing a bigger application entirely in C/C++, I could totally see lots of small allocations happening at different points (and getting into wackiness with custom pooled allocators and auto_ptr).
That's essentially what I do in much of my C code: implement pool allocators. And that's essentially what the post we're commenting on is talking about.
Incidentally, good common case for 8-byte allocations that is actually common in real-world C code: 32 bit linked list nodes (4 bytes nextptr, 4 bytes dataptr).
Hmm, are you sure that this (linked lists with small elements) is a good example which you should tell people about?
Linked lists are bad because they have big overhead (8 bytes for every element in your case - which dominates by a large margin if you store an int in each node for example) and they are really bad for CPU caches (they have very little spatial locality), thus slowing down your code.
I much rather prefer the "growing array" approach.
Yeah, but they have some unique characteristics related to inserting/deleting from the middle or beginning of the list. They're actually used in a few places in the linux kernel, for performance reasons, in spite of cache locality issues.
That is similar to the signal processing application I worked on a couple years ago. It was a real-time application with fixed sizes at each point in the algorithm, so we knew maximum memory usage at initialization. Thus, we allocated all memory up-front and continuously reused the buffers until we were shut down. Under the operational architecture we were moving to when I left, our application never actually called malloc at all; the control program that ran the signal processor and which started us malloced massive segments of memory on each machine and gave it to us, which we then "allocated" to our own buffers.
> 8-byte mallocs are expensive because most mallocs have per-allocation memory overhead to track things
As you say, most mallocs have this overhead. But with a bit of alignment trickery and bit masking you can put the header storage at the start of a page, bringing per-allocation overhead down to just a few bits.
I ran the test with a payload of 32 bytes. Here are the results:
Non-Contented test:
* ptmalloc: alloc 39M/s, free 49M/s
* tcmalloc: alloc 42M/s, free 40M/s
* jemalloc: alloc 21M/s, free 21M/s
* t_stack: alloc 110M/s
Contended test:
* ptmalloc: alloc 6.1M/s, free 6.4M/s
* tcmalloc: alloc 25M/s, free 11M/s
* jemalloc: alloc 18M/s, free 6.5M/s
Note that the results are less accurate that those provided in the article since the number of allocations per batch is smaller (due to the limited amount of RAM available on my desktop).
That's said, the article does not pretend benching realistic patterns.
The post starts out by explaining that malloc is a generic allocator and is thus less performant. Your comment seems premised on the idea that the author is suggesting a replacement for malloc in all its use cases, which he pointedly is not doing. Instead of him re-running his benchmarks, you should give the post a closer re-read.
I hate to go meta, but that last sentence in your post could have been omitted with no detriment to the post's message and great improvement to the tone.
Very interesting article, especially the t_scope allocator - I never knew you could get GCC to perform that cleanup automagically. One minor grammar point: isn't a lock under contention a contended lock, not a contented lock?
In lieu of the implementation, I'd like to know if these allocators are themselves based on malloc or if you have some tricky assembly/kernel code going on somewhere.
Yes, I wish that cleanup was a portable C feature! Glad to see that GNU is trying something here, as it would serve as a prototype for standardization. Perhaps in a future version of C...
I haven't written much C++ in quite a few years - mostly do Ruby these days, so I'm sure I'm making some terribly embarrassing faux pas or other with the example below. But you don't need more than the C++ functionality to compose your own variations if you want more flexibility.
Better would be to just have the Scope class take a single function to run and establish multiple stack entries for them. This would be my version:
#include <cstdio>
template <typename tFunc>
class ScopeFunc {
private:
tFunc mFunc;
public:
ScopeFunc(tFunc func) : mFunc(func) {}
~ScopeFunc() {
mFunc();
}
};
template <typename tFunc>
ScopeFunc<tFunc> on_return(tFunc func)
{
return ScopeFunc<tFunc>(func);
}
int main() {
auto first = on_return([]() { std::puts("first"); });
auto second = on_return([]() { std::puts("second"); });
std::puts("Hello");
}
Note that this is C++11, using lambdas. Also that the output will be reversed from your expectation since it's a lifo, but that's probably actually what you want for real deferred behaviour and not text output.
It also optimizes nicely, which yours may not because the loop unrolling may be complicated and there's a higher chance of aliasing of the function pointers in the vector:
Good idea, this definitely accomplishes the same functionality, but it's done at runtime, rather than the compiler knowing all the functions at compile time. Perhaps a minor difference for most cases, though...
I would still prefer a C extension... Perhaps I should just use Go these days, though ironically all these memory allocation policies are useless in a GC language.
You underestimate the compiler. This formulation might be tricky, but the version I just posted (adjacent to this post) definitely leaves the compiler knowing exactly what's going to be called when the function exits.
No. The C vs C++ question is simple language vs comprehensive language. A simple language might be preferred for various reasons. For example, code is easier to review, which is Linus Torvalds argument. Bindings to other languages are easier to create. Compiler is simpler, which might be important for embedded and high-reliability jobs.
The diff is a truncation. The actual error rate is 0.5ms on average. By using a round instead of truncation, we can reduce the error to 0.25ms on average.
Well, that much is obvious. But if you are going to truncate, you should be consistent. Always truncate towards 0, not sometimes towards 0 and sometimes towards infinity.
The fact that returning memory to the Kernel is hard is supported by the circumstance, that most allocators will use brk/sbrk to resize the data segment of the executing process to allocate memory, at least if they shall allocate few memory.
The other fact, that allocators have to lock global data structures is also not true. Most modern operating systems supports thread-local storage and therefore you don't need locking because you can keep much per-threads allocators, and only if you want to release memory of a foreign thread you have to lock (but that's also bad practice in most cases).
Therefore, this article is great if your horizon end at the default allocators tcmalloc, ptmalloc and jemalloc, but the reality is much more complex. The fact that such a thing doesn't exists isn't founded in the fact that it's hard to implement, it's founded in the fact that there is no need for such an allocator, because most well-written software will allocate large chunks of memory.
> (...) that most allocators will use brk/sbrk to resize the data segment of the executing process to allocate memory, at least if they shall allocate few memory.
The other commonly used backend for malloc() is mmap() without underlying file:
Handy both when allocating large chunks of memory and for allocating pools for smaller suballocations. Has the additional benefit of being zeroed-out at low cost (or no cost at all -- for example via hardware DMA), and also playing nice on systems with constrained / fragmented address space, as kernel is free to allocate at any address visible from userspace.
> Most modern operating systems supports thread-local storage and therefore you don't need locking because you can keep much per-threads allocators
The article explicitly points this out, and points out the problem with it: It means wasting memory on per-thread pools, and the more threads you use, the larger the pools needs to be if you want to prevent contention, compounding the problem.
> because most well-written software will allocate large chunks of memory.
1. Most software is not well written.
2. Most large pieces well written pieces of software that allocates only large chunks of memory has some custom allocator of some sort (or horrible abuses of arrays) embedded somewhere to work around exactly the problems noted in the article. In many cases people end up wasting time writing the same types of specialised allocators over and over.
I've seen plenty of large C and C++ apps that'd have benefitted greatly from a simple arena allocator for example... And I have also seen countless of implementations of arena allocators and various pool allocators and tons of other variations.
In other words: These things do exist. They're common, to the point where they're often covered in books on C/C++. Especially for C++ where there is specific built in (though weak) support for custom allocators.
Not about heap allocators but I followed the "About" link to this text:
> At Intersec, technology matters…Because it’s the core of our business, we aim to provide our clients with the most innovative and disruptive technological solutions.
We do not believe in the benefits of reusing and staking external software bricks when developing our products. Our software is built in C language under Linux, with PHP/JavaScript for the web interfaces and it is continuously improved ...
So now I'm wondering whether PHP is actually perceived as being hard-core?
C+PHP+Ajax(+SCGI or FastCGI) is my go-to when I want to create essentially a custom webserver but don't actually want to reinvent an entire http daemon from scratch. The PHP is used to simplify routine annoying tasks, while letting the custom server do the fun stuff.
PHP can be thought of as ``configuration + templating language for your C application'', with the C application being both the http server, and stock and your custom PHP plugins.
Yes, the whole stack can get hardcore, as long as you don't force PHP to do what other parts of the stack (SQL, JS) excel at, if you end up processing large datasets in short time :^)
I think the really hard stuff isn't done via PHP. Seems to me they use PHP as the front end because they want to focus on their really valuable technology - their C code.
PHP is used as a small layer that enables talking to our C code from javascript. We have a custom (Protocol Buffer-like) protocol to manage our RPC, the PHP embeds a native module that implements that protocol and exposes a webservice to which our Javascript code can talk in order to provide some valuable user-experience on top of our C-written technologies.
Nowadays there is so little intelligence in the PHP that we only consider it as a pass-through layer.
In thinking about the t_stack allocator, I think I can see some cases for which you might want to use this instead of the alloca function, but there is not enough information to be sure if I am going down the correct mental path. Can you please explain when/why I should use t_stack instead of alloca?
There are two main issues with alloca: first you cannot deallocate or reallocate the memory, you just append more data to your frame. As a consequence, it is not suitable for dynamic allocations while the t_stack is.
The second drawback is that alloca allocates on the stack, as a consequence it is limited by the size of the stack (a few megabytes on recent linux distribution, and the actual size of remaining stack depends on the callstack, since each frame consumes some stack and may have put huge buffers/alloca on it already). The t_stack has no hard-limit.
Additionally, by being totally separated from the stack, the t_stack provides a flexible alternative to the stack: you have finer-grained control on allocation/deallocation patterns.
As said in another comment, the drawbacks of alloca are explained in the previous article of the series.
In fact, I was thinking of alloca the whole article and I really don't see the benefits of implementing a "custom solution" in detriment of a well working existing one.
It would be great to compare the results they give against alloca =)
Really interesting and well written, thanks for that. If you wrote some more about heap allocation strategies (best-fit, worst-fit, first-fit, etc.) to round out the discussion I'd love to read that as well, especially if you add varying allocation sizes to your benchmark.
Older allocators with per-thread caches used to behave very badly with cross-thread frees, accumulating tons of freed objects in threads that didn't necessarily allocate a lot of objects. Tcmalloc uses a garbage collection process to move those objects back to the central free list.
The test in the article, where one thread does all the allocations and another does all the frees, basically subverts the thread-caching in tcmalloc, and just tests how quickly the garbage collection process can move freed objects from the free()-thread's cache back to the central heap where they can be reused by the malloc()-thread.