Memory – Part 4: Intersec’s custom allocators

rayiner · on Oct 1, 2013

The thread test is janky. Most multithreaded allocators optimize for the (common) case that objects are freed by the same thread that creates them. When objects are freed by different threads than the ones that allocated them, typically some sort of slow-path is invoked.

Older allocators with per-thread caches used to behave very badly with cross-thread frees, accumulating tons of freed objects in threads that didn't necessarily allocate a lot of objects. Tcmalloc uses a garbage collection process to move those objects back to the central free list.

The test in the article, where one thread does all the allocations and another does all the frees, basically subverts the thread-caching in tcmalloc, and just tests how quickly the garbage collection process can move freed objects from the free()-thread's cache back to the central heap where they can be reused by the malloc()-thread.

fruneau · on Oct 2, 2013

I admit that test stress some corner cases (at least some cases that the allocator designer consider as corner cases). That said, malloc has no choice but supporting that use case.

A use case for such pattern is a message-posting with workers: you queue some messages that are later unqueued and processed by a different thread. This is an increasingly common pattern in modern programs. In that pattern the message is allocated in one thread (let say the main one) and processed then deallocated by another thread.

If your implementation of message allocation is malloc-based, then you will stress the exact same code paths the benchmark is stressing.

JoachimSchipper · on Oct 2, 2013

You're not wrong that malloc-based message passing causes that load on malloc, but if performance of the message-passing code is important, you'd want to use a ring buffer anyway - cross-CPU or not, malloc is pretty slow.

fruneau · on Oct 2, 2013

Clearly, we go back to the initial statement: for specific use cases, we need specific allocators.

exDM69 · on Oct 1, 2013

It should be obvious that a lot of 8 byte mallocs will give bad performance and horrible memory use. This article and in particular the benchmarks in it would be a lot more informative if the test case was more realistic.

Please add at least 32 or 64 bytes of payload to the linked list structure and re-run the benchmarks. Even that is a very small allocation block, but is on the lower end of realistic allocation sizes although not a good practice.

barrkel · on Oct 1, 2013

8-byte mallocs are expensive because most mallocs have per-allocation memory overhead to track things. This is exactly why you may want to use a different allocator.

IOW, you're angle is that instead of finding a solution to the problem, instead choose a different problem. You don't always have that luxury.

My background on this problem is compilers. Compilers allocate lots of little structures that represent tree nodes, values, tokens, etc. Forcing them all to be a minimum of 32 or 64 bytes in size on the basis that would justify using malloc for them, would be more than a little bizarre. Arena allocation - both per module (for structures that need to persist for the whole compilation) and stack based (for structures that are discarded after e.g. evaluation or codegen) - makes far more sense than contorting the problem so that malloc makes sense.

exDM69 · on Oct 1, 2013

> 8-byte mallocs are expensive because most mallocs have per-allocation memory overhead to track things. This is exactly why you may want to use a different allocator.

I guess if the objective of the article is to point out the obvious fact that mallocing 8 bytes at a time is a bad idea, then showing up some actual numbers from actual malloc implementations is a good idea. However, even small objects in practical problems are usually bigger than 8 bytes, so making the allocation size a bit bigger would give more realistic figures.

Overall I think the article was informative and well written but more realistic test case would better point out when to write a custom allocator and what allocator to choose for a particular usage pattern.

fruneau · on Oct 1, 2013

The objective of the article wasn't to stress the specific case of the 8-bytes allocation pattern, it was about showing that malloc behavior depends on the context. The size of 8 bytes had been chosen because it was small enough to allow a large number of allocation to be performed so that timing were quite accurate in the results, however the main goal was to show the difference between was are called the "contended" and the "uncontended" case: your program may perform properly with single-threaded workload, but poorly in a multi-threaded environment, not because of explicit locking in your code but because some resource sharing is hidden behind the allocator.

Also, the choice of lot of allocations + lot of deallocations pattern was chosen because this is an issue we ran into quite recently: we allocated a huge tree structure progressively and sometimes we flushed it to disk. The flush was quite efficient, but the deallocation blocked the program during approximatively 30s. As a quick fix, we put the deallocation in background, but this slowed the tree construction down by approximatively 50% because of the contention of the allocator. Even if the size of the chunks in the article were not realistic, the results were near-realistic enough to be considered publishable.

I provided (in a separate comment) the result of the benchmarks with a 32-bytes payload (in that benchmark, all the allocators had the same 12% overhead in term of memory, but we clearly see the same performance pattern as with the 8-bytes payload).

jbooth · on Oct 1, 2013

I kinda see it more as "we're using an unrealistic workload to emphasize the per-malloc overhead". A program that was written to minimize and batch allocations wouldn't be as demonstrative.

tptacek · on Oct 1, 2013

Contra "exDM69", there's nothing unreasonable or "unrealistic" about allocating 8 bytes at a time; it's in fact extremely convenient to be able to do that. I think it's telling that someone would call that workload unrealistic; it indicates to me that they've never even really considered alternatives to malloc.

It's a little like those C programmers who try to use 256 byte static arrays for all their strings because they don't quite grok malloc/realloc.

jbooth · on Oct 1, 2013

Well, personally, if I'm using C, it's typically for a very narrowly defined, performance-sensitive problem. Both times I've done this in the last couple years, I found myself doing a couple big mallocs at the start and then running some tight loops over that memory with no further mallocs -- for use cases with more allocation, I'll just use whatever other language is more convenient, preferably one with a GC.

But if I were writing a bigger application entirely in C/C++, I could totally see lots of small allocations happening at different points (and getting into wackiness with custom pooled allocators and auto_ptr).

tptacek · on Oct 1, 2013

That's essentially what I do in much of my C code: implement pool allocators. And that's essentially what the post we're commenting on is talking about.

Incidentally, good common case for 8-byte allocations that is actually common in real-world C code: 32 bit linked list nodes (4 bytes nextptr, 4 bytes dataptr).

cdman · on Oct 2, 2013

Hmm, are you sure that this (linked lists with small elements) is a good example which you should tell people about?

Linked lists are bad because they have big overhead (8 bytes for every element in your case - which dominates by a large margin if you store an int in each node for example) and they are really bad for CPU caches (they have very little spatial locality), thus slowing down your code.

I much rather prefer the "growing array" approach.

jbooth · on Oct 2, 2013

Yeah, but they have some unique characteristics related to inserting/deleting from the middle or beginning of the list. They're actually used in a few places in the linux kernel, for performance reasons, in spite of cache locality issues.

vonmoltke · on Oct 1, 2013

That is similar to the signal processing application I worked on a couple years ago. It was a real-time application with fixed sizes at each point in the algorithm, so we knew maximum memory usage at initialization. Thus, we allocated all memory up-front and continuously reused the buffers until we were shut down. Under the operational architecture we were moving to when I left, our application never actually called malloc at all; the control program that ran the signal processor and which started us malloced massive segments of memory on each machine and gave it to us, which we then "allocated" to our own buffers.

tomlu · on Oct 1, 2013

> 8-byte mallocs are expensive because most mallocs have per-allocation memory overhead to track things

As you say, most mallocs have this overhead. But with a bit of alignment trickery and bit masking you can put the header storage at the start of a page, bringing per-allocation overhead down to just a few bits.

fruneau · on Oct 1, 2013

I ran the test with a payload of 32 bytes. Here are the results:

Non-Contented test:

* ptmalloc: alloc 39M/s, free 49M/s

* tcmalloc: alloc 42M/s, free 40M/s

* jemalloc: alloc 21M/s, free 21M/s

* t_stack: alloc 110M/s

Contended test:

* ptmalloc: alloc 6.1M/s, free 6.4M/s

* tcmalloc: alloc 25M/s, free 11M/s

* jemalloc: alloc 18M/s, free 6.5M/s

Note that the results are less accurate that those provided in the article since the number of allocations per batch is smaller (due to the limited amount of RAM available on my desktop).

That's said, the article does not pretend benching realistic patterns.

tptacek · on Oct 1, 2013

The post starts out by explaining that malloc is a generic allocator and is thus less performant. Your comment seems premised on the idea that the author is suggesting a replacement for malloc in all its use cases, which he pointedly is not doing. Instead of him re-running his benchmarks, you should give the post a closer re-read.

mmatants · on Oct 1, 2013

I hate to go meta, but that last sentence in your post could have been omitted with no detriment to the post's message and great improvement to the tone.

cs648 · on Oct 1, 2013

Very interesting article, especially the t_scope allocator - I never knew you could get GCC to perform that cleanup automagically. One minor grammar point: isn't a lock under contention a contended lock, not a contented lock?

fruneau · on Oct 1, 2013

Wording issue fixed.

dman · on Oct 1, 2013

Is the source for the allocators freely available? Would love to study those.

fruneau · on Oct 1, 2013

Unfortunately, not for the moment.

aktau · on Oct 1, 2013

I'd also like to put in a request for either open-sourcing or a more detailed overview of the implementations, they sound really interesting.

fruneau · on Oct 1, 2013

I'll consider writing a more detailed article on the subject. Open-sourcing the code will not be possible in the short-term.

robjh · on Oct 1, 2013

In lieu of the implementation, I'd like to know if these allocators are themselves based on malloc or if you have some tricky assembly/kernel code going on somewhere.

fruneau · on Oct 1, 2013

These allocators are based on mmap to build the arenas.

deletes · on Oct 1, 2013

+1, for request of implementation details. I'm really curious about this.

epistasis · on Oct 1, 2013

Yes, I wish that cleanup was a portable C feature! Glad to see that GNU is trying something here, as it would serve as a prototype for standardization. Perhaps in a future version of C...

qznc · on Oct 1, 2013

There is an extended version of C, which is has this feature and is nearly as widely ported as C. They aptly named it C++.

epistasis · on Oct 1, 2013

C++ has an IMHO worse version of this feature, that requires a custom type, and that only allows a single function to be called for that type.

This is more like Go's defer, and is far more appropriate for my use cases.

vidarh · on Oct 1, 2013

I haven't written much C++ in quite a few years - mostly do Ruby these days, so I'm sure I'm making some terribly embarrassing faux pas or other with the example below. But you don't need more than the C++ functionality to compose your own variations if you want more flexibility.

For example:

  #include <vector>
  #include <iostream>

  class Scope {

  private:
    typedef std::vector<void (*)()> FV;
    FV fv;
  public:
    void on_return(void (* f)()) {
      fv.push_back(f);
    }

    ~Scope() {
      for (FV::iterator it = fv.begin(); it != fv.end(); ++it) {
        (*it)();
      }
    }
  };

  void foo() {
    std::cout << "Hello ";
  }

  void bar() {
    std::cout << "World" << std::endl;
  }

  int main() {
    Scope scope;

    scope.on_return(foo);
    scope.on_return(bar);

    std::cout << "Hi" << std::endl;
  }

With C++11 lambda syntax you can do quite a bit better.

Expanding that into something providing at least most of what Go's "defer" does shouldn't be too hard.

stormbrew · on Oct 1, 2013

Better would be to just have the Scope class take a single function to run and establish multiple stack entries for them. This would be my version:

  #include <cstdio>

  template <typename tFunc>
  class ScopeFunc {
  private:
    tFunc mFunc;

  public:
    ScopeFunc(tFunc func) : mFunc(func) {}

    ~ScopeFunc() {
      mFunc();
    }
  };

  template <typename tFunc>
  ScopeFunc<tFunc> on_return(tFunc func)
  {
    return ScopeFunc<tFunc>(func);
  }

  int main() {
    auto first = on_return([]() { std::puts("first"); });
    auto second = on_return([]() { std::puts("second"); });

    std::puts("Hello");
  }

Note that this is C++11, using lambdas. Also that the output will be reversed from your expectation since it's a lifo, but that's probably actually what you want for real deferred behaviour and not text output.

It also optimizes nicely, which yours may not because the loop unrolling may be complicated and there's a higher chance of aliasing of the function pointers in the vector:

  0000000000400600 <main>:
    400600:	53                   	push   %rbx
    400601:	bf e4 07 40 00       	mov    $0x4007e4,%edi
    400606:	e8 b5 ff ff ff       	callq  4005c0 <puts@plt>
    40060b:	bf ea 07 40 00       	mov    $0x4007ea,%edi
    400610:	e8 ab ff ff ff       	callq  4005c0 <puts@plt>
    400615:	bf f1 07 40 00       	mov    $0x4007f1,%edi
    40061a:	e8 a1 ff ff ff       	callq  4005c0 <puts@plt>
    40061f:	31 c0                	xor    %eax,%eax
    400621:	5b                   	pop    %rbx
    400622:	c3                   	retq

epistasis · on Oct 1, 2013

Good idea, this definitely accomplishes the same functionality, but it's done at runtime, rather than the compiler knowing all the functions at compile time. Perhaps a minor difference for most cases, though...

I would still prefer a C extension... Perhaps I should just use Go these days, though ironically all these memory allocation policies are useless in a GC language.

stormbrew · on Oct 1, 2013

You underestimate the compiler. This formulation might be tricky, but the version I just posted (adjacent to this post) definitely leaves the compiler knowing exactly what's going to be called when the function exits.

eliasmacpherson · on Oct 1, 2013

Then "they" later discovered the name was no longer apt and it should have been called ++C.

yk · on Oct 1, 2013

C++ is the correct name, you write a few thousand lines C code and then write class LargeObject in the first line. So C++.

eliasmacpherson · on Oct 1, 2013

I'm concerned that would create a temporary variable, I'll try and decipher the ISO spec to verify. Is LargeObject POD?

z3phyr · on Oct 1, 2013

Would C++ completely replace C, EVER? On top of that extensions like Unified Parallel C and Cello come up!

qznc · on Oct 2, 2013

No. The C vs C++ question is simple language vs comprehensive language. A simple language might be preferred for various reasons. For example, code is easier to review, which is Linus Torvalds argument. Bindings to other languages are easier to create. Compiler is simpler, which might be important for embedded and high-reliability jobs.

blue11 · on Oct 1, 2013

Sorry to nitpick, but I believe the time difference code has an error of 1 millisecond 25% of the time:

    int64_t delta = tv2->tv_sec - tv1->tv_sec;
    return delta * 1000 + (tv2->tv_usec - tv1->tv_usec) / 1000;

One way to fix it is:

    int64_t deltasec = tv2->tv_sec - tv1->tv_sec - 1;
    int64_t deltausec = tv2->tv_usec - tv1->tv_usec + 1000000;
    return deltasec * 1000 + deltausec / 1000;

fruneau · on Oct 1, 2013

The diff is a truncation. The actual error rate is 0.5ms on average. By using a round instead of truncation, we can reduce the error to 0.25ms on average.

blue11 · on Oct 1, 2013

Well, that much is obvious. But if you are going to truncate, you should be consistent. Always truncate towards 0, not sometimes towards 0 and sometimes towards infinity.

eeadc · on Oct 1, 2013

The fact that returning memory to the Kernel is hard is supported by the circumstance, that most allocators will use brk/sbrk to resize the data segment of the executing process to allocate memory, at least if they shall allocate few memory.

The other fact, that allocators have to lock global data structures is also not true. Most modern operating systems supports thread-local storage and therefore you don't need locking because you can keep much per-threads allocators, and only if you want to release memory of a foreign thread you have to lock (but that's also bad practice in most cases).

Therefore, this article is great if your horizon end at the default allocators tcmalloc, ptmalloc and jemalloc, but the reality is much more complex. The fact that such a thing doesn't exists isn't founded in the fact that it's hard to implement, it's founded in the fact that there is no need for such an allocator, because most well-written software will allocate large chunks of memory.

dexen · on Oct 1, 2013

> (...) that most allocators will use brk/sbrk to resize the data segment of the executing process to allocate memory, at least if they shall allocate few memory.

The other commonly used backend for malloc() is mmap() without underlying file:

  void *chunk = mmap(NULL, length, PROT_READ|PROT_WRITE, MAP_ANONYMOUS, -1, 0);

Handy both when allocating large chunks of memory and for allocating pools for smaller suballocations. Has the additional benefit of being zeroed-out at low cost (or no cost at all -- for example via hardware DMA), and also playing nice on systems with constrained / fragmented address space, as kernel is free to allocate at any address visible from userspace.

vidarh · on Oct 1, 2013

> Most modern operating systems supports thread-local storage and therefore you don't need locking because you can keep much per-threads allocators

The article explicitly points this out, and points out the problem with it: It means wasting memory on per-thread pools, and the more threads you use, the larger the pools needs to be if you want to prevent contention, compounding the problem.

> because most well-written software will allocate large chunks of memory.

1. Most software is not well written.

2. Most large pieces well written pieces of software that allocates only large chunks of memory has some custom allocator of some sort (or horrible abuses of arrays) embedded somewhere to work around exactly the problems noted in the article. In many cases people end up wasting time writing the same types of specialised allocators over and over.

I've seen plenty of large C and C++ apps that'd have benefitted greatly from a simple arena allocator for example... And I have also seen countless of implementations of arena allocators and various pool allocators and tons of other variations.

In other words: These things do exist. They're common, to the point where they're often covered in books on C/C++. Especially for C++ where there is specific built in (though weak) support for custom allocators.

ori_b · on Oct 1, 2013

> because most well-written software will allocate large chunks of memory.

The average string length in most programs is about 5 to 10 bytes. Plenty of well written software works with strings like that.

ArbitraryLimits · on Oct 1, 2013

Not about heap allocators but I followed the "About" link to this text:

> At Intersec, technology matters…Because it’s the core of our business, we aim to provide our clients with the most innovative and disruptive technological solutions. We do not believe in the benefits of reusing and staking external software bricks when developing our products. Our software is built in C language under Linux, with PHP/JavaScript for the web interfaces and it is continuously improved ...

So now I'm wondering whether PHP is actually perceived as being hard-core?

Also, how would one stake a brick?

xamuel · on Oct 1, 2013

C+PHP+Ajax(+SCGI or FastCGI) is my go-to when I want to create essentially a custom webserver but don't actually want to reinvent an entire http daemon from scratch. The PHP is used to simplify routine annoying tasks, while letting the custom server do the fun stuff.

dexen · on Oct 1, 2013

PHP can be thought of as ``configuration + templating language for your C application'', with the C application being both the http server, and stock and your custom PHP plugins.

Yes, the whole stack can get hardcore, as long as you don't force PHP to do what other parts of the stack (SQL, JS) excel at, if you end up processing large datasets in short time :^)

chris_wot · on Oct 1, 2013

I think the really hard stuff isn't done via PHP. Seems to me they use PHP as the front end because they want to focus on their really valuable technology - their C code.

fruneau · on Oct 1, 2013

PHP is used as a small layer that enables talking to our C code from javascript. We have a custom (Protocol Buffer-like) protocol to manage our RPC, the PHP embeds a native module that implements that protocol and exposes a webservice to which our Javascript code can talk in order to provide some valuable user-experience on top of our C-written technologies.

Nowadays there is so little intelligence in the PHP that we only consider it as a pass-through layer.

bd_at_rivenhill · on Oct 1, 2013

In thinking about the t_stack allocator, I think I can see some cases for which you might want to use this instead of the alloca function, but there is not enough information to be sure if I am going down the correct mental path. Can you please explain when/why I should use t_stack instead of alloca?

fruneau · on Oct 2, 2013

There are two main issues with alloca: first you cannot deallocate or reallocate the memory, you just append more data to your frame. As a consequence, it is not suitable for dynamic allocations while the t_stack is.

The second drawback is that alloca allocates on the stack, as a consequence it is limited by the size of the stack (a few megabytes on recent linux distribution, and the actual size of remaining stack depends on the callstack, since each frame consumes some stack and may have put huge buffers/alloca on it already). The t_stack has no hard-limit.

Additionally, by being totally separated from the stack, the t_stack provides a flexible alternative to the stack: you have finer-grained control on allocation/deallocation patterns.

As said in another comment, the drawbacks of alloca are explained in the previous article of the series.

professorTuring · on Oct 2, 2013

In fact, I was thinking of alloca the whole article and I really don't see the benefits of implementing a "custom solution" in detriment of a well working existing one.

It would be great to compare the results they give against alloca =)

fruneau · on Oct 2, 2013

alloca has its drawbacks. See the previous article in the series: https://techtalk.intersec.com/2013/08/memory-part-3-managing...

zwieback · on Oct 1, 2013

Really interesting and well written, thanks for that. If you wrote some more about heap allocation strategies (best-fit, worst-fit, first-fit, etc.) to round out the discussion I'd love to read that as well, especially if you add varying allocation sizes to your benchmark.

dkhenry · on Oct 1, 2013

I cannot for the life of me find this t_stack allocator he talks about. Anyone have a link ?

vineel · on Oct 1, 2013

It's a custom allocator internal to Intersec.

deweerdt · on Oct 1, 2013

Which version of jemalloc was used in the benchmark?

fruneau · on Oct 1, 2013

jemalloc 3.4.0 (current package in debian sid)