Module 10: "Design of Shared Memory Multiprocessors"
  Lecture 20: "Performance of Coherence Protocols"
 

Large cache line

  • Large cache lines are intended to amortize the DRAM access and bus transfer latency over a large number of data points
  • But false sharing becomes a problem
  • Hardware solutions
    • Coherence at subblock level: divide the cache line into smaller blocks and maintain coherence for each of them; subblock invalidation on a write reduces chances of coherence misses even in the presence of false sharing
    • Delay invalidations: send invalidations only after the writer has completed several writes; but this directly impacts the write propagation model and hence leads to consistency models weaker than SC
    • Use update-based protocols instead of invalidation-based: probably not a good idea

Performance of update protocol

  • Already discussed main trade-offs
  • Consider running a sequential program on an SMP with update protocol
    • If the kernel decides to migrate the process to a different processor subsequent updates will go to caches that are never used: “pack-rat” phenomenon
  • Possible designs that combine update and invalidation-based protocols
    • For each page, decide what type of protocol to run and make it part of the translation (i.e. hold it in TLB)
    • Otherwise dynamically detect for each cache line what protocol is good

Hybrid inval+update

  • One possible hybrid protocol
    • Keep a counter per cache line and make Dragon update protocol the default
    • Every time the local processor accesses a cache line set its counter to some pre-defined threshold k
    • On each received update decrease the counter by one
    • When the counter reaches zero, the line is locally invalidated hoping that eventually the writer will switch to M state from Sm state when no sharers are left

Update-based protocol

  • Update-based protocols tend to increase capacity misses slightly
    • Cache lines stay longer in the cache compared to an invalidation-based protocol; why?
  • Update-based protocols can significantly reduce coherence misses
    • True sharing misses definitely go down
    • False sharing misses also decrease due to absence of invalidations
  • But update-based protocols significantly increase bus bandwidth demand
    • This increases bus contention and delays other transactions
    • Possible to delay the updates by merging a number of them in a buffer

Shared cache

  • Advantages
    • If there is only one level of cache no need for a coherence protocol
    • Very fine-grained sharing resulting in fast cross-thread communication
    • No false sharing
    • Smaller cache capacity requirement: overlapped working set
    • One processor’s fetched cache line can be used by others: prefetch effect
  • Disadvantages
    • High cache bandwidth requirement and port contention
    • Destructive interference and conflict misses
  • Will revisit this when discussing chip multiprocessing and hyper-threading