Module 10: "Design of Shared Memory Multiprocessors"
  Lecture 20: "Performance of Coherence Protocols"
 

Protocol optimizations

  • MSI vs. MESI
    • Need to measure bus bandwidth consumption with and without E state because E state saves the otherwise S to M BusUpgr transactions
    • Turns out that the E state is not very helpful
    • The main reason is the E to M transition is rare; normally some other processor also reads the line before the write takes place (if at all)
  • How important is BusUpgr?
    • Again need to look at bus bandwidth consumption with BusUpgr and with BusUpgr replaced by BusRdX
    • Turns out that BusUpgr helps
  • Smaller caches demand more bus bandwidth
    • Especially when the primary working set does not fit in cache

Cache size

  • With increasing problem size normally working set size also increases
    • More pressure on cache
  • With increasing number of processors working set per processor goes down
    • Less pressure on cache
    • This effect sometimes leads to superlinear speedup i.e. on P processors you get speedup more than P
  • Important to design the parallel program so that the critical working sets fit in cache
    • Otherwise bus bandwidth requirement may increase dramatically

Cache line size

  • Uniprocessors have three C misses: cold, capacity, conflict
  • Multiprocessors add two new types
    • True sharing miss: inherent in the algorithm e.g., P0 writes x and P1 uses x, so P1 will suffer from a true sharing miss when it reads x
    • False sharing miss: artifactual miss due to cache line size e.g. P0 writes x and P1 reads y, but x and y belong to the same cache line
  • True and false sharing together form the communication or coherence misses in multiprocessors making it four C misses
  • Technology is pushing for large cache line sizes, but…
  • Increasing cache line size helps reduce
    • Cold misses if there is spatial locality
    • True sharing misses if the algorithm is properly structured to exploit spatial locality
  • Increasing cache line size
    • Reduces the number of sets in a fixed-sized cache and may lead to more conflict misses
    • May increase the volume of false sharing
    • May increase miss penalty depending on the bus transfer algorithm (need to transfer more data per miss)
    • May fetch unnecessary data and waste bandwidth
  • Note that true sharing misses will exist even with an infinite cache
  • Impact of cache line size on true sharing heavily depends on application characteristics
    • Blocked matrix computations tend to have good spatial locality with shared data because they access data in small blocks thereby exploiting temporal as well as spatial locality
    • Nearest neighbor computations tend to have little spatial locality when accessing left and right border elements
  • The exact proportion of various types of misses in an application normally changes with cache size, problem size and the number of processors
    • With small cache, capacity miss may dominate everything else
    • With large cache, true sharing misses may cause the major traffic

Impact on bus traffic

  • When cache line size is increased it may seem that we bring in more data together and have better spatial locality and reuse
    • Should reduce bus traffic per unit computation
  • However, bus traffic normally increases monotonically with cache line size
    • Unless we have enough spatial and temporal locality to exploit, bus traffic will increase
    • For most cases bus bandwidth requirement attains a minimum at a block size different from the minimum size; this is because at very small line sizes the overhead of communication becomes too high