Module 7: Synchronization
  Lecture 14: Scalable Locks and Barriers
 


Point-to-point Synch.

  • Normally done in software with flags

P0: A = 1; flag = 1;
P1: while (!flag); print A;

  • Some old machines supported full/empty bits in memory
    • Each memory location is augmented with a full/empty bit
    • Producer writes the location only if bit is reset
    • Consumer reads location if bit is set and resets it
    • Lot less flexible: one producer-one consumer sharing only (one producer-many consumers is very popular); all accesses to a memory location become synchronized (unless compiler flags some accesses as special)
  • Possible optimization for shared memory
    • Allocate flag and data structures (if small) guarded by flag in same cache line e.g., flag and A in above example

Barrier

  • High-level classification of barriers
    • Hardware and software barriers
  • Will focus on two types of software barriers
    • Centralized barrier: every processor polls a single count
    • Distributed tree barrier: shows much better scalability
  • Performance goals of a barrier implementation
    • Low latency: After all processors have arrived at the barrier, they should be able to leave quickly
    • Low traffic: Minimize bus transaction and contention
    • Scalability: Latency and traffic should scale slowly with the number of processors
    • Low storage: Barrier state should not be big
    • Fairness: Preserve some strict order of barrier exit (could be FIFO according to arrival order); a particular processor should not always be the last one to exit