I see. So if I may paraphrase, TSX can also help write work-loads by eliminating...

sparky · on Feb 6, 2013

> I see. So if I may paraphrase, TSX can also help write work-loads by eliminating locks that weren't actually needed because no actual contention over the same data occurred. This can happen if the lock was too coarse-grained and the writes were actually to different memory, or if the dynamically-executed path didn't actually mutate shared memory.

Yes, those two cases are what I'd consider secondary benefits, that could be achieved straightforwardly (but not easily, as you note) without TSX. The primary case -- when the lock is fine-grained, and thread A does mutate a value that thread B does eventually read, but B's read doesn't come until A is already out of the critical section -- is the big win for well-designed software, IMO.

> It's unclear to me whether recovering from mis-speculation is more expensive than acquiring the lock up-front. In other words, is there overhead in the case that you almost always encounter actual contention?

On big x86 cores, there shouldn't be much performance overhead. x86 already has mechanisms for holding tons of in-flight memory operations in the core without releasing and/or committing them to the rest of the memory system. TSX simply specifies a new reason for holding them ("we're in the middle of an SLE transaction"). If your transaction is aborted because another core touched the lock, you can flush that buffer of memory ops and rollback to a checkpointed register file almost instantly. The paper describes a hardware predictor that decides whether or not to initiate a transaction on a given store; in any smart implementation, this predictor would incorporate information like "have I incurred expensive rollbacks due to transaction aborts on this store?" In that way, a smart implementation should tune itself to yield arbitrarily low average- or worst-case overhead over the non-TSX case.

> It's also unclear to me how this can be done without instruction set support, as the paper claims. Without having the critical sections explicitly marked how can this be implemented?

(No pedantry intended, I'm sure you already read this) It's explained in Section 3.3. They look for this pattern:

  Initially, MEM[x] = A
  STORE MEM[x] = B
  ...
  ...
  STORE MEM[x] = A (original value)

They consider anything between these "silent store pairs" to be a potential critical section, and the predictor can choose to try to elide the two stores and form a transaction, if it wishes. It is always OK to elide these two stores, whether it's a "real" critical section or not, as long as nobody tries to read MEM[x] in the middle (in the memory model they assume, anyway).

You definitely require ISA support if:

1) Your silent stores may have other side effects (I/O?)

2) You have synchronization primitives that don't fit the silent store pattern (ticket lock?)

3) You have a strong business interest in making the behavior of existing software, even at the microarchitectural level, as predictable as possible (Intel?)