1. [30 points] For each of the memory reference streams given in the
following, compare the cost of executing it on a bus-based SMP that supports
(a) MESI protocol without cache-to-cache sharing, and (b) Dragon protocol.
A read from processor N is denoted by rN while a write from processor N is
denoted by wN. Assume that all caches are empty to start with and that cache
hits take a single cycle, misses requiring upgrade or update take 60 cycles,
and misses requiring whole block transfer take 90 cycles. Assume that all
caches are writeback.
Stream1: r1 w1 r1 w1 r2 w2 r2 w2 r3 w3 r3 w3
Stream2: r1 r2 r3 w1 w2 w3 r1 r2 r3 w3 w1
Stream3: r1 r2 r3 r3 w1 w1 w1 w1 w2 w3
[For each stream for each protocol: 5 points]
2. [15 points] (a) As cache miss latency increases, does an update protocol
become more or less preferable as compared to an invalidation based protocol?
Explain.
(b) In a multi-level cache hierarchy, would you propagate updates all the way
to the first-level cache? What are the alternative design choices?
(c) Why is update-based protocol not a good idea for multiprogramming workloads
running on SMPs?
3. [20 points] Assuming all variables to be initialized to zero, enumerate all
outcomes possible under sequential consistency for the following code segments.
(a) P1: A=1;
P2: u=A; B=1;
P3: v=B; w=A;
(b) P1: A=1;
P2: u=A; v=B;
P3: B=1;
P4: w=B; x=A;
(c) P1: u=A; A=u+1;
P2: v=A; A=v+1;
(d) P1: fetch-and-inc (A)
P2: fetch-and-inc (A)
4. [30 points] Consider a quad SMP using a MESI protocol (without
cache-to-cache sharing). Each processor tries to acquire a test-and-set lock
to gain access to a null critical section. Assume that test-and-set
instructions always go on the bus and they take the same time as the normal
read transactions. The initial condition is such that processor 1 has the lock
and processors 2, 3, 4 are spinning on their caches waiting for the lock to be
released. Every processor gets the lock once, unlocks, and then exits the
program. Consider the bus transactions related to the lock/unlock operations
only.
(a) What is the least number of transactions executed to get from the initial
to the final state? [10 points]
(b) What is the worst-case number of transactions? [5 points]
(c) Answer the above two questions if the protocol is changed to Dragon. [15 points]
5. [30 points] Answer the above question for a test-and-test-and-set lock for a
16-processor SMP. The initial condition is such that the lock is released and
no one has got the lock yet.
6. [10 points] If the lock variable is not allowed to be cached, how will the
traffic of a test-and-set lock compare against that of a test-and-test-and set
lock?
7. [15 points] You are given a bus-based shared memory machine. Assume that
the processors have a cache block size of 32 bytes and A is an array of
integers (four bytes each). You want to parallelize the following loop.
for(i=0; i<17; i++) {
for (j=0; j<256; j++) {
A[j] = do_something(A[j]);
}
}
(a) Under what conditions would it be better to use a dynamically scheduled
loop?
(b) Under what conditions would it be better to use a statically scheduled
loop?
(c) For a dynamically scheduled inner loop, how many iterations should a
processor pick each time?
8. [20 points] The following barrier implementation is wrong. Make as little
change as possible to correct it.
struct bar_struct {
LOCKDEC(lock);
int count; // Initialized to zero
int releasing; // Initialized to zero
} bar;
void BARRIER (int P) {
LOCK(bar.lock);
bar.count++;
if (bar.count == P) {
bar.releasing = 1;
bar.count--;
}
else {
UNLOCK(bar.lock);
while (!bar.releasing);
LOCK(bar.lock);
bar.count--;
if (bar.count == 0) {
bar.releasing = 0;
}
}
UNLOCK(bar.lock);
}