| |
Centralized barrier
- How fast is it?
- Assume that the program is perfectly balanced and hence all processors arrive at the barrier at the same time
- Latency is proportional to P due to the critical section (assume that the lock algorithm exhibits at most O(P) latency)
- The amount of traffic of acquire section (the CS) depends on the lock algorithm; after everyone has settled in the waiting loop the last processor will generate a BusRdX during release (flag write) and others will subsequently generate BusRd before releasing: O(P)
- Scalability turns out to be low partly due to the critical section and partly due to O(P) traffic of release
- No fairness in terms of who exits first
Tree barrier
- Does not need a lock, only uses flags
- Arrange the processors logically in a binary tree (higher degree also possible)
- Two siblings tell each other of arrival via simple flags (i.e. one waits on a flag while the other sets it on arrival)
- One of them moves up the tree to participate in the next level of the barrier
- Introduces concurrency in the barrier algorithm since independent subtrees can proceed in parallel
- Takes log(P) steps to complete the acquire
- A fixed processor starts a downward pass of release waking up other processors that in turn set other flags
- Shows much better scalability compared to centralized barriers in DSM multiprocessors; the advantage in small bus-based systems is not much, since all transactions are any way serialized on the bus; in fact the additional log (P) delay may hurt performance in bus-based SMPs
TreeBarrier (pid, P) {
unsigned int i, mask;
for (i = 0, mask = 1; (mask & pid) != 0; ++i, mask <<= 1) {
while (!flag[pid][i]);
flag[pid][i] = 0;
}
if (pid < (P - 1)) {
flag[pid + mask][i] = 1;
while (!flag[pid][MAX- 1]);
flag[pid][MAX - 1] = 0;
}
for (mask >>= 1; mask > 0; mask >>= 1) {
flag[pid - mask][MAX-1] = 1;
}
}
- Convince yourself that this works
- Take 8 processors and arrange them on leaves of a tree of depth 3
- You will find that only odd nodes move up at every level during acquire (implemented in the first for loop)
- The even nodes just set the flags (the first statement in the if condition): they bail out of the first loop with mask=1
- The release is initiated by the last processor in the last for loop; only odd nodes execute this loop (7 wakes up 3, 5, 6; 5 wakes up 4; 3 wakes up 1, 2; 1 wakes up 0)
- Each processor will need at most log (P) + 1 flags
- Avoid false sharing: allocate each processor’s flags on a separate chunk of cache lines
- With some memory wastage (possibly worth it) allocate each processor’s flags on a separate page and map that page locally in that processor’s physical memory
- Avoid remote misses in DSM multiprocessor
- Does not matter in bus-based SMPs
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
|
|
|
|