|
Protocol optimizations
- MSI vs. MESI
- Need to measure bus bandwidth consumption with and without E state because E state saves the otherwise S to M BusUpgr transactions
- Turns out that the E state is not very helpful
- The main reason is the E to M transition is rare; normally some other processor also reads the line before the write takes place (if at all)
- How important is BusUpgr?
- Again need to look at bus bandwidth consumption with BusUpgr and with BusUpgr replaced by BusRdX
- Turns out that BusUpgr helps
- Smaller caches demand more bus bandwidth
- Especially when the primary working set does not fit in cache
Cache size
- With increasing problem size normally working set size also increases
- With increasing number of processors working set per processor goes down
- Less pressure on cache
- This effect sometimes leads to superlinear speedup i.e. on P processors you get speedup more than P
- Important to design the parallel program so that the critical working sets fit in cache
- Otherwise bus bandwidth requirement may increase dramatically
Cache line size
- Uniprocessors have three C misses: cold, capacity, conflict
- Multiprocessors add two new types
- True sharing miss: inherent in the algorithm e.g., P0 writes x and P1 uses x, so P1 will suffer from a true sharing miss when it reads x
- False sharing miss: artifactual miss due to cache line size e.g. P0 writes x and P1 reads y, but x and y belong to the same cache line
- True and false sharing together form the communication or coherence misses in multiprocessors making it four C misses
- Technology is pushing for large cache line sizes, but…
- Increasing cache line size helps reduce
- Cold misses if there is spatial locality
- True sharing misses if the algorithm is properly structured to exploit spatial locality
- Increasing cache line size
- Reduces the number of sets in a fixed-sized cache and may lead to more conflict misses
- May increase the volume of false sharing
- May increase miss penalty depending on the bus transfer algorithm (need to transfer more data per miss)
- May fetch unnecessary data and waste bandwidth
- Note that true sharing misses will exist even with an infinite cache
- Impact of cache line size on true sharing heavily depends on application characteristics
- Blocked matrix computations tend to have good spatial locality with shared data because they access data in small blocks thereby exploiting temporal as well as spatial locality
- Nearest neighbor computations tend to have little spatial locality when accessing left and right border elements
- The exact proportion of various types of misses in an application normally changes with cache size, problem size and the number of processors
- With small cache, capacity miss may dominate everything else
- With large cache, true sharing misses may cause the major traffic
Impact on bus traffic
- When cache line size is increased it may seem that we bring in more data together and have better spatial locality and reuse
- Should reduce bus traffic per unit computation
- However, bus traffic normally increases monotonically with cache line size
- Unless we have enough spatial and temporal locality to exploit, bus traffic will increase
- For most cases bus bandwidth requirement attains a minimum at a block size different from the minimum size; this is because at very small line sizes the overhead of communication becomes too high
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|