1. [10 points] Suppose you are given a program that does a fixed amount of
work, and some fraction s of that work must be done sequentially. The
remaining portion of the work is perfectly parallelizable on P processors.
Derive a formula for execution time on P processors and establish an upper
bound on the achievable speedup.
2. [40 points] Suppose you want to transfer n bytes from a source node S to a
destination node D and there are H links between S and D. Therefore, notice
that there are H+1 routers in the path (including the ones in S and D).
Suppose W is the node-to-network bandwidth at each router. So at S you require
n/W time to copy the message into the router buffer. Similarly, to copy the
message from the buffer of router in S to the buffer of the next router on the
path, you require another n/W time. Assuming a store-and-forward protocol total
time spent doing these copy operations would be (H+2)n/W and the data will
end up in some memory buffer in D. On top of this, at each router we spend R
amount of time to figure out the exit port. So the total time taken to transfer
n bytes from S to D in a store-and-forward protocol is (H+2)n/W+(H+1)R. On
the other hand, if you assume a cut-through protocol the critical path would
just be n/W+(H+1)R. Here we assume the best possible scenario where the
header routing delay at each node is exposed and only the startup n/W delay
at S is exposed. The rest is pipelined. Now suppose that you are asked to
compare the performance of these two routing protocols on an 8x8 grid. Compute
the maximum, minimum, and average latency to transfer an n byte message in this
topology for both the protocols. Assume the following values: W=3.2 GB/s and
R=10 ns. Compute for n=64 and 256. Note that for each protocol you will have
three answers (maximum, minimum, average) for each value of n. Here GB means
10^9 bytes and not 2^30 bytes.
3. [20 points] Consider a simple computation on an nxn double matrix (each
element is 8 bytes) where each element A[i][j] is modified as follows.
A[i][j] = A[i][j] - (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1])/4.
Suppose you assign one matrix element to one processor (i.e. you have n^2
processors). Compute the total amount of data communication between processors.
4. [30 points] Consider a machine running at 10^8 instructions per second on
some workload with the following mix: 50% ALU instructions, 20% load
instructions, 10% store instructions, and 20% branch instructions. Suppose the
instruction cache miss rate is 1%, the writeback data cache miss rate is 5%,
and the cache line size is 32 bytes. Assume that a store miss requires two
cache line transfers, one to load the newly updated line and one to replace
the dirty line at a later point in time. If the machine provides a 250 MB/s
bus, how many processors can it accommodate at peak bus bandwidth?