Module 7: Synchronization
  Lecture 14: Scalable Locks and Barriers
 


Centralized Barrier

  • How fast is it?
    • Assume that the program is perfectly balanced and hence all processors arrive at the barrier at the same time
    • Latency is proportional to P due to the critical section (assume that the lock algorithm exhibits at most O(P) latency)
    • The amount of traffic of acquire section (the CS) depends on the lock algorithm; after everyone has settled in the waiting loop the last processor will generate a BusRdX during release (flag write) and others will subsequently generate BusRd before releasing: O(P)
    • Scalability turns out to be low partly due to the critical section and partly due to O(P) traffic of release
    • No fairness in terms of who exits first

Tree Barrier

  • Does not need a lock, only uses flags
    • Arrange the processors logically in a binary tree (higher degree also possible)
    • Two siblings tell each other of arrival via simple flags (i.e. one waits on a flag while the other sets it on arrival)
    • One of them moves up the tree to participate in the next level of the barrier
    • Introduces concurrency in the barrier algorithm since independent subtrees can proceed in parallel
    • Takes log(P) steps to complete the acquire
    • A fixed processor starts a downward pass of release waking up other processors that in turn set other flags
    • Shows much better scalability compared to centralized barriers in DSM multiprocessors; the advantage in small bus-based systems is not much, since all transactions are any way serialized on the bus; in fact the additional log (P) delay may hurt performance in bus-based SMPs