I've seen many queues implemented similarly to this, and it works fine. Every thread that calls fetch and add receives a different result, therefore, wrapping asside, there can only be one thread with access to each slot. If there is only one consumer and one producer you don't need fetch and add at all (that is how fastflow is implemented.) You must check that your slot is populated by either comparing with the other index (producer if your the consumer), which is stupid because that cache line is highly contended. The better way is set a slot to null after you've consumed it, and if your the producer, you first check if the current slot is null before attempting to increment the index. A race condition exists where the slot can be null before incrementing but not afterward (e.g. only one slot free but several producers saw that and incremented.) In reality all that means is the producer can skip slots and the consumer must keep incrementing if it sees a null slot rather than assume an empty queue (can be tempered by checking x slots and then falling back to checking the producer index.) Consumer and producer indexes must be on seperate cache lines and some effort can be made to keep consumer and producer on sperate cache lines in the actual queue. e.g. each slot can be a whole cache line and pop() and push() can be batched to work with 8 elements at a time. In other words OP is a noob and I could write a queue that whoops his hiney (danger: attempt at humor.)
Ok, so it is relatively easy to obtain a safe implementation and from there you can use a heuristic to prevent a thread from looping forever trying to find an available slot. Would that sum up the issue in practice?
I'd rephrase that as it's easy to create a safe implementation, but the dangers and biggest payoffs are realized when you don't access both producer and consumer counters in the same thread all the time. cache lines shared between the consumer and producer and written to by at least one of them are the hottest points of contention. With careful design you can get that to zero in the common cases.
There are different kinds of lock freedom, but in this case if you read carefully, that's a normal operating state and does not block consumers. It's the race I mentioned. The consumers have to be written to skip over a limited number of null entries and then check the actual producer index. In most cases that means a little extra iteration. It would be very rare to need check the producer index, which makes it fast.