Objectives_template

	Centralized barrier How fast is it? Assume that the program is perfectly balanced and hence all processors arrive at the barrier at the same time Latency is proportional to P due to the critical section (assume that the lock algorithm exhibits at most O(P) latency) The amount of traffic of acquire section (the CS) depends on the lock algorithm; after everyone has settled in the waiting loop the last processor will generate a BusRdX during release (flag write) and others will subsequently generate BusRd before releasing: O(P) Scalability turns out to be low partly due to the critical section and partly due to O(P) traffic of release No fairness in terms of who exits first Tree barrier Does not need a lock, only uses flags Arrange the processors logically in a binary tree (higher degree also possible) Two siblings tell each other of arrival via simple flags (i.e. one waits on a flag while the other sets it on arrival) One of them moves up the tree to participate in the next level of the barrier Introduces concurrency in the barrier algorithm since independent subtrees can proceed in parallel Takes log(P) steps to complete the acquire A fixed processor starts a downward pass of release waking up other processors that in turn set other flags Shows much better scalability compared to centralized barriers in DSM multiprocessors; the advantage in small bus-based systems is not much, since all transactions are any way serialized on the bus; in fact the additional log (P) delay may hurt performance in bus-based SMPs TreeBarrier (pid, P) { unsigned int i, mask; for (i = 0, mask = 1; (mask & pid) != 0; ++i, mask <<= 1) { while (!flag[pid][i]); flag[pid][i] = 0; } if (pid < (P - 1)) { flag[pid + mask][i] = 1; while (!flag[pid][MAX- 1]); flag[pid][MAX - 1] = 0; } for (mask >>= 1; mask > 0; mask >>= 1) { flag[pid - mask][MAX-1] = 1; } } Convince yourself that this works Take 8 processors and arrange them on leaves of a tree of depth 3 You will find that only odd nodes move up at every level during acquire (implemented in the first for loop) The even nodes just set the flags (the first statement in the if condition): they bail out of the first loop with mask=1 The release is initiated by the last processor in the last for loop; only odd nodes execute this loop (7 wakes up 3, 5, 6; 5 wakes up 4; 3 wakes up 1, 2; 1 wakes up 0) Each processor will need at most log (P) + 1 flags Avoid false sharing: allocate each processor’s flags on a separate chunk of cache lines With some memory wastage (possibly worth it) allocate each processor’s flags on a separate page and map that page locally in that processor’s physical memory Avoid remote misses in DSM multiprocessor Does not matter in bus-based SMPs