Exercise : 2

These problems should be tried after module 12 is completed.

1. [30 points] For each of the memory reference streams given in the following, compare the cost of executing it on a bus-based SMP that supports (a) MESI protocol without cache-to-cache sharing, and (b) Dragon protocol. A read from processor N is denoted by rN while a write from processor N is denoted by wN. Assume that all caches are empty to start with and that cache hits take a single cycle, misses requiring upgrade or update take 60 cycles, and misses requiring whole block transfer take 90 cycles. Assume that all caches are writeback.

Stream1: r1 w1 r1 w1 r2 w2 r2 w2 r3 w3 r3 w3
Stream2: r1 r2 r3 w1 w2 w3 r1 r2 r3 w3 w1
Stream3: r1 r2 r3 r3 w1 w1 w1 w1 w2 w3

[For each stream for each protocol: 5 points]

2. [15 points] (a) As cache miss latency increases, does an update protocol become more or less preferable as compared to an invalidation based protocol? Explain.

(b) In a multi-level cache hierarchy, would you propagate updates all the way to the first-level cache? What are the alternative design choices?

(c) Why is update-based protocol not a good idea for multiprogramming workloads running on SMPs?

3. [20 points] Assuming all variables to be initialized to zero, enumerate all outcomes possible under sequential consistency for the following code segments.

(a) P1: A=1;
P2: u=A; B=1;
P3: v=B; w=A;

(b) P1: A=1;
P2: u=A; v=B;
P3: B=1;
P4: w=B; x=A;

(c) P1: u=A; A=u+1;
P2: v=A; A=v+1;

(d) P1: fetch-and-inc (A)
P2: fetch-and-inc (A)

4. [30 points] Consider a quad SMP using a MESI protocol (without cache-to-cache sharing). Each processor tries to acquire a test-and-set lock to gain access to a null critical section. Assume that test-and-set instructions always go on the bus and they take the same time as the normal read transactions. The initial condition is such that processor 1 has the lock and processors 2, 3, 4 are spinning on their caches waiting for the lock to be released. Every processor gets the lock once, unlocks, and then exits the program. Consider the bus transactions related to the lock/unlock operations only.

(a) What is the least number of transactions executed to get from the initial to the final state? [10 points]

(b) What is the worst-case number of transactions? [5 points]

(c) Answer the above two questions if the protocol is changed to Dragon. [15 points]

5. [30 points] Answer the above question for a test-and-test-and-set lock for a 16-processor SMP. The initial condition is such that the lock is released and no one has got the lock yet.

6. [10 points] If the lock variable is not allowed to be cached, how will the traffic of a test-and-set lock compare against that of a test-and-test-and set lock?

7. [15 points] You are given a bus-based shared memory machine. Assume that the processors have a cache block size of 32 bytes and A is an array of integers (four bytes each). You want to parallelize the following loop.

for(i=0; i<17; i++) {
for (j=0; j<256; j++) {
A[j] = do_something(A[j]);
}
}

(a) Under what conditions would it be better to use a dynamically scheduled loop?

(b) Under what conditions would it be better to use a statically scheduled loop?

(c) For a dynamically scheduled inner loop, how many iterations should a processor pick each time?

8. [20 points] The following barrier implementation is wrong. Make as little change as possible to correct it.

struct bar_struct {
LOCKDEC(lock);
int count; // Initialized to zero
int releasing; // Initialized to zero
} bar;

void BARRIER (int P) {
LOCK(bar.lock);
bar.count++;
if (bar.count == P) {
bar.releasing = 1;
bar.count--;
}
else {
UNLOCK(bar.lock);
while (!bar.releasing);
LOCK(bar.lock);
bar.count--;
if (bar.count == 0) {
bar.releasing = 0;
}
}
UNLOCK(bar.lock);
}