Exercise : 1

These problems should be tried after module 08 is completed.

1. [10 points] Suppose you are given a program that does a fixed amount of work, and some fraction s of that work must be done sequentially. The remaining portion of the work is perfectly parallelizable on P processors. Derive a formula for execution time on P processors and establish an upper bound on the achievable speedup.

2. [40 points] Suppose you want to transfer n bytes from a source node S to a destination node D and there are H links between S and D. Therefore, notice that there are H+1 routers in the path (including the ones in S and D). Suppose W is the node-to-network bandwidth at each router. So at S you require n/W time to copy the message into the router buffer. Similarly, to copy the message from the buffer of router in S to the buffer of the next router on the path, you require another n/W time. Assuming a store-and-forward protocol total time spent doing these copy operations would be (H+2)n/W and the data will end up in some memory buffer in D. On top of this, at each router we spend R amount of time to figure out the exit port. So the total time taken to transfer n bytes from S to D in a store-and-forward protocol is (H+2)n/W+(H+1)R. On the other hand, if you assume a cut-through protocol the critical path would just be n/W+(H+1)R. Here we assume the best possible scenario where the header routing delay at each node is exposed and only the startup n/W delay at S is exposed. The rest is pipelined. Now suppose that you are asked to compare the performance of these two routing protocols on an 8x8 grid. Compute the maximum, minimum, and average latency to transfer an n byte message in this topology for both the protocols. Assume the following values: W=3.2 GB/s and R=10 ns. Compute for n=64 and 256. Note that for each protocol you will have three answers (maximum, minimum, average) for each value of n. Here GB means 10^9 bytes and not 2^30 bytes.

3. [20 points] Consider a simple computation on an nxn double matrix (each element is 8 bytes) where each element A[i][j] is modified as follows. A[i][j] = A[i][j] - (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1])/4. Suppose you assign one matrix element to one processor (i.e. you have n^2 processors). Compute the total amount of data communication between processors.

4. [30 points] Consider a machine running at 10^8 instructions per second on some workload with the following mix: 50% ALU instructions, 20% load instructions, 10% store instructions, and 20% branch instructions. Suppose the instruction cache miss rate is 1%, the writeback data cache miss rate is 5%, and the cache line size is 32 bytes. Assume that a store miss requires two cache line transfers, one to load the newly updated line and one to replace the dirty line at a later point in time. If the machine provides a 250 MB/s bus, how many processors can it accommodate at peak bus bandwidth?