The article says a lot about CPUs reordering memory instructions - is this actua...

mumblingdrunk · on May 12, 2024

All CPUs commit in order and except precisely, because most other options are insane, or would drive you to it. However: single thread commit order =/= observability order.

Observability order of memory operations --- which are the only operations that matter --- are governed by the memory consistency model of the architecture. x86 has what's generally referred to as strong ordering on memory operations.

On x86, part of it means that stores from the same core cannot be observed out of order from each other, nor can loads.

So assuming the compiler does not move the `tail++` up, or move the assignment out of the if-statement (both of which are achieved by marking them `volatile`), the code should actually work on x86. The `tail++` change cannot be observed before the write to the queue and the reading from the queue cannot be observed before the reading of the `tail` and `head` variables.

On RISC-V and Arm, you need more as they have substantially weaker memory consistency. The RISC-V specs have some examples of interesting outcomes you can have. Some of it involves time-travel.

But in the end: yes the reordering done by the CPU is the issue. The compiler can and does reorder stuff when it thinks that it'll unlock more instruction-level parallelism, but no amount of volatile is going to make that queue universally usable on RISC-V. No matter what the compiler does. Even perfectly preserving the single-thread semantics of the code, not reordering a single instruction, the CPU can move stuff around in terms of observability. The alternative is that the compiler inserts a barrier/fence after every instruction.

There are trade-offs. Poorly written code for x86 can absolutely tank performance because of ordering violations requiring code to be replayed, though that is sometimes a problem in even weaker consistency models as well.

mcdeltat · on May 12, 2024

Valid points, although I have another perspective on this bit:

> But in the end: yes the reordering done by the CPU is the issue

I think from a programmer perspective, the CPU side of things is mostly beside the point (unless you're writing assembly), and this contributes to the misunderstanding and air of mystery surrounding thread safety.

At the end of the day the CPU can do anything, really. I'd argue this doesn't matter because the compiler is generating machine code, not us. What does matter is the contract between us and the compiler / language spec. Without language-level synchronisation the code is not valid C/C++ and we will likely observe unexpected behaviour - either due to CPU reordering or compiler optimisations, doesn't matter.

I think the article is somewhat missing the point by presenting the case somewhat pretending that the compiler is not part of the equation. It seems like often people think they know how to do thread safety because they know, e.g. what reorderings the CPU may do. "Just need to add volatile here and we're good!" (probably wrong). In reality they need to understand how the language models concurrency.

We could translate that queue code into another language with a different concurrency model - e.g. Python - and now the behaviour is different despite the CPU doing the same fundamental reorderings.

mattnewport · on May 12, 2024

This is true but in practice it's pretty common to find this sort of code seems to work fine on x64 because the compiler doesn't actually reorder things and then sometimes blows up on ARM (or PowerPC, though that's less commonly encountered in the wild these days).

eqvinox · on May 12, 2024

The searchable keyword to look for here (re x86) is "Total Store Ordering" (TSO).

(I'm not gonna try to summarize it here because I'd probably get it ever so subtly wrong…)

x86 has a very strong memory model, meaning it is a very poor test platform. Last time I touched atomics I used PowerPC (e500mc & e6500) to test, which was a good thing as it did point me at a problem. Not sure where current ARM falls on this. The uncontested, notorious king of weak memory ordering is DEC Alpha, but these are a bit hard to run these days. If you want to go truly insane, look at DEC Alpha consumer dependency (non-)ordering :)

dbcurtis · on May 12, 2024

Yes, having worked on one of the out-of-order Intel CPU's, I can tell you that you are correct. Instructions may be "complete", as in their results can be forwarded to later operations, but the instruction isn't "retired" until it is known that it can not raise an exception, or be cancelled because of branch mis-predict, etc. Programmer-visible architectural state as defined in the ISA is not written until instruction retirement. CPU re-ordering instructions is not going to change semantics (in X86 and similar architectures... there are some archs that relax that guarantee).

Compilers are notorious for doing dumb things around locks.... the gnu C for AVR architecture, for instance, looks at the SEI instruction (SEt Interrupt mask bit) and notices that it doesn't modify memory or registers, so hoists it to the top of functions. Eh.. No, SEI; CLI; <code> <critical section> <code> is not what I intended...

Also... CPU's with data caches can to smart things with architecturally-defined locking instructions such as "test-and-set' or 'compare-and-exchange' such that the instructions are always cache-coherent across CPU's. If you try to roll-your-own locking code, you had best understand how the cache invalidation mechanism works in your chosen CPU or you are going to have a bad day.

moonchild · on May 12, 2024

> CPU's with data caches can to smart things with architecturally-defined locking instructions such as "test-and-set' or 'compare-and-exchange' such that the instructions are always cache-coherent across CPU's. If you try to roll-your-own locking code, you had best understand how the cache invalidation mechanism works in your chosen CPU or you are going to have a bad day.

What do you mean? Are you implying that read-modify-writes are treated differently from plain writes by the cache coherency protocol?

dbcurtis · on May 13, 2024

I'm saying that an atomic RMW is going to get the cache line in "exclusive" state, (in typical MESI protocol) but that if you are trying to gin up the equivalent with spin locks you need to think through how that plays out, as the reads might be in "shared" state.

moonchild · on May 13, 2024

I still don't see what you're getting at. What is the implication of this for software? The implementation of the cache coherency protocol is largely opaque to software.

gpderetta · on May 12, 2024

The article does mention in the end that a trivial translation of the queue code to x86 will work.

It is broken on other architectures though (aside of the obvious UB because of races).