Objectives_template

	Protocol optimizations MSI vs. MESI Need to measure bus bandwidth consumption with and without E state because E state saves the otherwise S to M BusUpgr transactions Turns out that the E state is not very helpful The main reason is the E to M transition is rare; normally some other processor also reads the line before the write takes place (if at all) How important is BusUpgr? Again need to look at bus bandwidth consumption with BusUpgr and with BusUpgr replaced by BusRdX Turns out that BusUpgr helps Smaller caches demand more bus bandwidth Especially when the primary working set does not fit in cache Cache size With increasing problem size normally working set size also increases More pressure on cache With increasing number of processors working set per processor goes down Less pressure on cache This effect sometimes leads to superlinear speedup i.e. on P processors you get speedup more than P Important to design the parallel program so that the critical working sets fit in cache Otherwise bus bandwidth requirement may increase dramatically Cache line size Uniprocessors have three C misses: cold, capacity, conflict Multiprocessors add two new types True sharing miss: inherent in the algorithm e.g., P0 writes x and P1 uses x, so P1 will suffer from a true sharing miss when it reads x False sharing miss: artifactual miss due to cache line size e.g. P0 writes x and P1 reads y, but x and y belong to the same cache line True and false sharing together form the communication or coherence misses in multiprocessors making it four C misses Technology is pushing for large cache line sizes, but… Increasing cache line size helps reduce Cold misses if there is spatial locality True sharing misses if the algorithm is properly structured to exploit spatial locality Increasing cache line size Reduces the number of sets in a fixed-sized cache and may lead to more conflict misses May increase the volume of false sharing May increase miss penalty depending on the bus transfer algorithm (need to transfer more data per miss) May fetch unnecessary data and waste bandwidth Note that true sharing misses will exist even with an infinite cache Impact of cache line size on true sharing heavily depends on application characteristics Blocked matrix computations tend to have good spatial locality with shared data because they access data in small blocks thereby exploiting temporal as well as spatial locality Nearest neighbor computations tend to have little spatial locality when accessing left and right border elements The exact proportion of various types of misses in an application normally changes with cache size, problem size and the number of processors With small cache, capacity miss may dominate everything else With large cache, true sharing misses may cause the major traffic Impact on bus traffic When cache line size is increased it may seem that we bring in more data together and have better spatial locality and reuse Should reduce bus traffic per unit computation However, bus traffic normally increases monotonically with cache line size Unless we have enough spatial and temporal locality to exploit, bus traffic will increase For most cases bus bandwidth requirement attains a minimum at a block size different from the minimum size; this is because at very small line sizes the overhead of communication becomes too high