Module 9: "Introduction to Shared Memory Multiprocessors"
  Lecture 16: "Multiprocessor Organizations and Cache Coherence"
 

Shared vs. private in CMPs

  • Shared caches are often very large in the CMPs
    • They are banked to avoid worst-case wire delay
    • The banks are usually distributed across the floor of the chip on an interconnect
  • In shared caches, getting a block from a remote bank takes time proportional to the physical distance between the requester and the bank
    • Non-uniform cache architecture (NUCA)
  • This is same for private caches, if the data resides in a remote cache
  • Shared cache may have higher average hit latency than the private cache
    • Hopefully most hits in the latter will be local
  • Shared caches are most likely to have less misses than private caches
    • Latter wastes space due to replication

Cache coherence

  • Nothing unique to multiprocessors
    • Even uniprocessor computers need to worry about cache coherence
    • For sequential programs we expect a memory location to return the latest value written
    • For concurrent programs running on multiple threads or processes on a single processor we expect the same model to hold because all threads see the same cache hierarchy (same as shared L1 cache)
    • For multiprocessors there remains a danger of using a stale value: hardware must ensure that cached values are coherent across the system and they satisfy programmers’ intuitive memory model

Cache coherence: Example

  • Assume a write-through cache
    • P0: reads x from memory, puts it in its cache, and gets the value 5
    • P1: reads x from memory, puts it in its cache, and gets the value 5
    • P1: writes x=7, updates its cached value and memory value
    • P0: reads x from its cache and gets the value 5
    • P2: reads x from memory, puts it in its cache, and gets the value 7 (now the system is completely incoherent)
    • P2: writes x=10, updates its cached value and memory value
  • Consider the same example with a writeback cache
    • P0 has a cached value 5, P1 has 7, P2 has 10, memory has 5 (since caches are not write through)
    • The state of the line in P1 and P2 is M while the line in P0 is clean
    • Eviction of the line from P1 and P2 will issue writebacks while eviction of the line from P0 will not issue a writeback (clean lines do not need writeback)
    • Suppose P2 evicts the line first, and then P1
    • Final memory value is 7: we lost the store x=10 from P2