Module 11: "Synchronization"
  Lecture 22: "Scalable Locking Primitives"
 

Store conditional & OOO

  • Execution of SC in an OOO pipeline
    • Rather subtle
    • For now assume that SC issues only when it comes to the head of ROB i.e. non-speculative execution of SC
    • It first checks the load_linked bit; if reset doesn’t even access cache (saves cache bandwidth and unnecessary bus transactions) and returns zero in register
    • If load_linked bit is set, it accesses cache and issues bus transaction if needed (BusReadX if cache line in I state and BusUpgr if in S state)
    • Checks load_linked bit again before writing to cache (note that cache line goes to M state in any case)
    • Can wake up dependents only when SC graduates (a case where a store initiates a dependence chain)

Speculative SC?

  • What happens if SC is issued speculatively?
    • Actual store happens only when it graduates and issuing a store early only starts the write permission process
    • Suppose two processors are contending for a lock
    • Both do LL and succeed because nobody is in CS
    • Both issue SC speculatively and due to some reason the graduation of SC in both of them gets delayed
    • So although initially both may get the line one after another in M state in their caches, the load_linked bit will get reset in both by the time SC tries to graduate
    • They go back and start over with LL and may issue SC again speculatively leading to a livelock (probability of this type of livelock increases with more processors)
    • Speculative issue of SC with hardwired backoff may help
    • Better to turn off speculation for SC
  • What about the branch following SC?
    • Can we speculate past that branch?
    • Assume that the branch predictor tells you that the branch is not taken i.e. fall through: we speculatively venture into the critical section
    • We speculatively execute the critical section
    • This may be good and bad
    • If the branch prediction was correct we did great
    • If the predictor went wrong, we might have interfered with the execution of the processor that is actually in CS: may cause unnecessary invalidations and extra traffic
    • Any correctness issues?

Point-to-point synch.

  • Normally done in software with flags

    P0: A = 1; flag = 1;
    P1: while (!flag); print A;

  • Some old machines supported full/empty bits in memory
    • Each memory location is augmented with a full/empty bit
    • Producer writes the location only if bit is reset
    • Consumer reads location if bit is set and resets it
    • Lot less flexible: one producer-one consumer sharing only (one producer-many consumers is very popular); all accesses to a memory location become synchronized (unless compiler flags some accesses as special)
  • Possible optimization for shared memory
    • Allocate flag and data structures (if small) guarded by flag in same cache line e.g., flag and A in above example