Module 10: "Design of Shared Memory Multiprocessors"
  Lecture 18: "Introduction to Cache Coherence"
 

Hierarchical design

  • Possible to combine bus-based SMP and DSM to build hierarchical shared memory
    • Sun Wildfire connects four large SMPs (28 processors) over a scalable interconnect to form a 112p multiprocessor
    • IBM POWER4 has two processors on-chip with private L1 caches, but shared L2 and L3 caches (this is called a chip multiprocessor); connect these chips over a network to form scalable multiprocessors
  • Next few lectures will focus on bus-based SMPs only

Cache Coherence

  • Intuitive memory model
    • For sequential programs we expect a memory location to return the latest value written to that location
    • For concurrent programs running on multiple threads or processes on a single processor we expect the same model to hold because all threads see the same cache hierarchy (same as shared L1 cache)
    • For multiprocessors there remains a danger of using a stale value: in SMP or DSM the caches are not shared and processors are allowed to replicate data independently in each cache; hardware must ensure that cached values are coherent across the system and they satisfy programmers’ intuitive memory model

Example

  • Assume a write-through cache i.e. every store updates the value in cache as well as in memory
    • P0: reads x from memory, puts it in its cache, and gets the value 5
    • P1: reads x from memory, puts it in its cache, and gets the value 5
    • P1: writes x=7, updates its cached value and memory value
    • P0: reads x from its cache and gets the value 5
    • P2: reads x from memory, puts it in its cache, and gets the value 7 (now the system is completely incoherent)
    • P2: writes x=10, updates its cached value and memory value
  • Consider the same example with a writeback cache i.e. values are written back to memory only when the cache line is evicted from the cache
    • P0 has a cached value 5, P1 has 7, P2 has 10, memory has 5 (since caches are not write through)
    • The state of the line in P1 and P2 is M while the line in P0 is clean
    • Eviction of the line from P1 and P2 will issue writebacks while eviction of the line from P0 will not issue a writeback (clean lines do not need writeback)
    • Suppose P2 evicts the line first, and then P1
    • Final memory value is 7: we lost the store x=10 from P2

What went wrong?

  • For write through cache
    • The memory value may be correct if the writes are correctly ordered
    • But the system allowed a store to proceed when there is already a cached copy
    • Lesson learned: must invalidate all cached copies before allowing a store to proceed
  • Writeback cache
    • Problem is even more complicated: stores are no longer visible to memory immediately
    • Writeback order is important
    • Lesson learned: do not allow more than one copy of a cache line in M state
  • Need to formalize the intuitive memory model
    • In sequential programs the order of read/write is defined by the program order; the notion of “last write” is well-defined
    • For multiprocessors how do you define “last write to a memory location” in presence of independent caches?
    • Within a processor it is still fine, but how do you order read/write across processors?