Module 11: "Synchronization"
  Lecture 23: "Barriers and Speculative Synchronization"
 

Centralized barrier

  • How fast is it?
    • Assume that the program is perfectly balanced and hence all processors arrive at the barrier at the same time
    • Latency is proportional to P due to the critical section (assume that the lock algorithm exhibits at most O(P) latency)
    • The amount of traffic of acquire section (the CS) depends on the lock algorithm; after everyone has settled in the waiting loop the last processor will generate a BusRdX during release (flag write) and others will subsequently generate BusRd before releasing: O(P)
    • Scalability turns out to be low partly due to the critical section and partly due to O(P) traffic of release
    • No fairness in terms of who exits first

Tree barrier

  • Does not need a lock, only uses flags
    • Arrange the processors logically in a binary tree (higher degree also possible)
    • Two siblings tell each other of arrival via simple flags (i.e. one waits on a flag while the other sets it on arrival)
    • One of them moves up the tree to participate in the next level of the barrier
    • Introduces concurrency in the barrier algorithm since independent subtrees can proceed in parallel
    • Takes log(P) steps to complete the acquire
    • A fixed processor starts a downward pass of release waking up other processors that in turn set other flags
    • Shows much better scalability compared to centralized barriers in DSM multiprocessors; the advantage in small bus-based systems is not much, since all transactions are any way serialized on the bus; in fact the additional log (P) delay may hurt performance in bus-based SMPs

      TreeBarrier (pid, P) {
         unsigned int i, mask;
         for (i = 0, mask = 1; (mask & pid) != 0; ++i, mask <<= 1) {
            while (!flag[pid][i]);
            flag[pid][i] = 0;
         }
         if (pid < (P - 1)) {
            flag[pid + mask][i] = 1;
            while (!flag[pid][MAX- 1]);
            flag[pid][MAX - 1] = 0;
         }
         for (mask >>= 1; mask > 0; mask >>= 1) {
      flag[pid - mask][MAX-1] = 1; 
        }
      }

  • Convince yourself that this works
  • Take 8 processors and arrange them on leaves of a tree of depth 3
  • You will find that only odd nodes move up at every level during acquire (implemented in the first for loop)
  • The even nodes just set the flags (the first statement in the if condition): they bail out of the first loop with mask=1
  • The release is initiated by the last processor in the last for loop; only odd nodes execute this loop (7 wakes up 3, 5, 6; 5 wakes up 4; 3 wakes up 1, 2; 1 wakes up 0)
  • Each processor will need at most log (P) + 1 flags
  • Avoid false sharing: allocate each processor’s flags on a separate chunk of cache lines
  • With some memory wastage (possibly worth it) allocate each processor’s flags on a separate page and map that page locally in that processor’s physical memory
    • Avoid remote misses in DSM multiprocessor
    • Does not matter in bus-based SMPs