Module 14: "Directory-based Cache Coherence"
  Lecture 29: "Basics of Directory"
 

What is needed?

  • On every memory operation
    • Find the state of the cache line (normally present in cache)
    • Tell others about the operation if needed (achieved by broadcasting in bus-based or small-scale distributed systems)
    • Other processors must take appropriate actions
    • Also, need a mechanism to resolve races between concurrent conflicting accesses (i.e. one of them is a write): this essentially needs some central control on a per cache line basis
    • Atomic bus provides an easy way of serializing
    • Split-transaction bus with a distributed request table works only because every request table can see every transaction
  • Need to have a table that gets accessed/updated by every cache line access
    • This is the directory
    • Every cache line has a separate directory entry
    • The directory entry stores the state of the line, who the current owner is (if any), the sharers (if any), etc.
    • On a miss, the directory entry must be located, and appropriate coherence action must be taken
    • A popular architecture is to have a two-level hierarchy: each node is SMP, kept coherent via snoopy or directory protocol, and nodes are kept coherent by a scalable directory protocol (Convex Exemplar: directory-directory, Sequent, Data General, HAL, DASH: snoopy-directory)

Adv. of MP nodes

  • Amortization of node fixed cost over multiple processors; can use commodity SMPs
  • Much communication may be contained within a node i.e. less “remote” communication
  • Request combining by some extra hardware in memory controller
  • Possible to share caches e.g., chip multiprocessor nodes (IBM POWER4 and POWER5) or hyper-threaded nodes (Intel Xeon MP)
  • Exact benefit depends on sharing pattern
    • Widely shared data or nearest neighbor (if properly mapped) may be good

Disadvantages

  • Snoopy bus delays all accesses
    • The local snoop must complete first
    • Then only a request can be sent to remote home
    • Same delay may be incurred at the remote home also depending on the coherence scheme
    • This dictated SGI Origin 2000 to have dual processor nodes, but managed entirely by director
  • Bandwidth at critical points is shared by all processors
    • System bus, memory controller, DRAM, router
    • Bad communication patterns can actually result in execution time larger than uniprocessor nodes even though average “hop” time may be larger e.g., compare two 16P systems one with 4-way 4 nodes and one with 16 nodess