Solution of Exercise : 1

1. [10 points] Suppose you are given a program that does a fixed amount of work, and some fraction s of that work must be done sequentially. The remaining portion of the work is perfectly parallelizable on P processors. Derive a formula for execution time on P processors and establish an upper bound on the achievable speedup.

Solution: Execution time on P processors, T(P) = sT(1) + (1-s)T(1)/P. Speedup = 1/(s + (1-s)/P). Upper bound is achieved when P approaches infinity. So maximum speedup = 1/s. As expected, the upper bound on achievable speedup is inversely proportional to the sequential fraction.

2. [40 points] Suppose you want to transfer n bytes from a source node S to a destination node D and there are H links between S and D. Therefore, notice that there are H+1 routers in the path (including the ones in S and D). Suppose W is the node-to-network bandwidth at each router. So at S you require n/W time to copy the message into the router buffer. Similarly, to copy the message from the buffer of router in S to the buffer of the next router on the path, you require another n/W time. Assuming a store-and-forward protocol total time spent doing these copy operations would be (H+2)n/W and the data will end up in some memory buffer in D. On top of this, at each router we spend R amount of time to figure out the exit port. So the total time taken to transfer n bytes from S to D in a store-and-forward protocol is (H+2)n/W+(H+1)R. On the other hand, if you assume a cut-through protocol the critical path would just be n/W+(H+1)R. Here we assume the best possible scenario where the header routing delay at each node is exposed and only the startup n/W delay at S is exposed. The rest is pipelined. Now suppose that you are asked to compare the performance of these two routing protocols on an 8x8 grid. Compute the maximum, minimum, and average latency to transfer an n byte message in this topology for both the protocols. Assume the following values: W=3.2 GB/s and R=10 ns. Compute for n=64 and 256. Note that for each protocol you will have three answers (maximum, minimum, average) for each value of n. Here GB means 10^9 bytes and not 2^30 bytes.

Solution: The basic problem is to compute the maximum, minimum, and average values of H. The rest is just about substituting the values of the parameters. The maximum value of H is 14 while the minimum is 1. To compute the average, you need to consider all possible messages, compute H for them, and then take the average. Consider S=(x0, y0) and D=(x1, y1). So H = |x0-x1| + |y0-y1|. Therefore, average H = (sum over all x0, x1, y0, y1 |x0-x1| + |y0-y1|)/(64*63), where each of x0, x1, y0, y1 varies from 0 to 7. Clearly, this is same as (sum over x0, x1 |x0-x1| + sum over y0, y1 |y0-y1|)/63, which in turn is equal to 2*(sum over x0, x1 |x0-x1|)/63 = 2*(sum over x0=0 to 7, x1=0 to x0 (x0-x1)+ sum over x0=0 to 7, x1=x0+1 to 7 (x1-x0))/63 = 16/3.

3. [20 points] Consider a simple computation on an nxn double matrix (each element is 8 bytes) where each element A[i][j] is modified as follows. A[i][j] = A[i][j] - (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1])/4. Suppose you assign one matrix element to one processor (i.e. you have n^2 processors). Compute the total amount of data communication between processors.

Solution: Each processor requires the four neighbors i.e. 32 bytes. So total amount of data communicated is 32n^2.

4. [30 points] Consider a machine running at 10^8 instructions per second on some workload with the following mix: 50% ALU instructions, 20% load instructions, 10% store instructions, and 20% branch instructions. Suppose the instruction cache miss rate is 1%, the writeback data cache miss rate is 5%, and the cache line size is 32 bytes. Assume that a store miss requires two cache line transfers, one to load the newly updated line and one to replace the dirty line at a later point in time. If the machine provides a 250 MB/s bus, how many processors can it accommodate at peak bus bandwidth?

Solution: Let us compute the bandwidth requirement of the processor per second. Instruction cache misses 10^6 times transferring 32 bytes on each miss. Out of 20*10^6 loads 10^6 miss in the cache transferring 32 bytes on each miss. Out of 10^7 stores 5*10^5 miss in the cache transferring 64 bytes on each miss. Thus, total amount of data transferred per second is 96*10^6 bytes. Thus at most two processors can be supported on a 250 MB/s bus.