1. [5+5] Consider a 512-node system. Each node has 4 GB of main memory. The cache
block size is 128 bytes. What is the total directory memory size for
(a) bitvector, (b) DiriB with i=3?
2. [5+5+5+10] For a simple two-processor NUMA system, the number of cache misses to three virtual pages X, Y, Z is as follows.
Page X: P0 has 14 misses, P1 has 11 misses. P1 takes the first miss.
Page Y: P0 has zero misses, P1 has 18 misses.
Page Z: P0 has 15 misses, P1 has 9 misses. P0 takes the first miss.
The remote to local miss latency ratio is four. Evaluate the aggregate time
spent in misses by the two processors in each of the following policies.
Assume that a local miss takes 400 cycles.
(a) First touch placement.
(b) All three pages on P0.
(c) All three pages on P1.
(d) Best possible application-directed static placement i.e. a one-time call
to the OS to place the three pages.
3. [30] Suppose you want to transpose a matrix in parallel. The source matrix is A
and the destination matrix is B. Both A and B are decomposed in such a way
that each node gets a chunk of consecutive rows. Application-directed
page placement is used to map the pages belonging to each chunk in the local memory of the respective nodes. Now the transpose can be done in two
ways. The first algorithm, known as "local read algorithm", allows each
node to transpose the band of rows local to it. So naturally this algorithm
involves a large fraction of remote writes to matrix B. Assume that the band
is wide enough so that there is no false sharing when writing to matrix B.
The second algorithm, known as "local write algorithm", allows each
node to transpose a band of columns of A such that all the writes to B are
local. Naturally, this algorithm involves a large number of remote reads in
matrix A. Assume that the algorithms are properly tiled so that the cache
utilization is good. In both the cases, before doing the transpose, every node
reads and writes to its local segment in A and after doing the transpose every
node reads and writes to its local segment in B. Assuming an
invalidation-based cache coherence protocol, briefly but clearly explain which
algorithm is expected to deliver better performance. How much synchronization
does each algorithm require (in terms of the number of critical sections and
barriers)? Assume that the caches are of infinite capacity and that a remote write is equally expensive in all respects as a remote read because in both
cases the retirement is held up for a sequentially consistent implementation.
4. [5+5] If a compiler reorders accesses according to WO and the underlying
processor is SC, what is the consistency model observed by the programmer?
What if the compiler produces SC code, but the processor is RC?
5. [5+5] Consider the following piece of code running on a faulty microprocessor
system which does not preserve any program order other than true data and
control dependence order.
LOCK (L1)
load A
store A
UNLOCK (L1)
load B
store B
LOCK (L1)
load C
store C
UNLOCK (L1)
(a) Insert appropriate WMB and MB instructions to enforce SC. Do not over-insert
anything.
(b) Repeat part (a) to enforce RC.
6. [5+10] Consider implementing a directory-based protocol to keep the private L1
caches of the cores in a chip-multiprocessor (CMP) coherent. Assume that the
CMP has a shared L2 cache. The most natural way of doing it is to attach a
directory to the tag of each L2 cache block. An L1 cache miss gets forwarded
to the L2 cache. The directory entry is looked up in parallel with the L2
cache block. Does this implementation suit an inclusive hierarchy or exclusive
hierarchy? Explain. If the cache block sizes of the L1 cache and the L2 cache
are different, explain briefly how you will manage the state of the L2 cache
block on receiving a writeback from the L1 cache. Do not assume per sector
directory entry in this design.