Module 14: "Directory-based Cache Coherence"
  Lecture 31: "Managing Directory Overhead"
 

Starvation

  • NACKs can cause starvation
    • Build a FIFO list of waiters either in home memory (Chaudhuri and Heinrich, 2004) or use a distributed linked list (IEEE Scalable Coherent Interface)
      • Former imposes large occupancy on home, yet offers better performance by read combining
      • Latter is an extremely complex protocol with a large number of transient states and 29 stable states, but does distribute the occupancy across the system
    • Origin 2000 devotes extra bits in the directory to raise priority of requests NACKed too many times (above a threshold)
    • Use delay between retries
    • Use Alpha GS320 protocol (will discuss later)

Overflow schemes

  • How to make the directory size independent of the number of processors
    • Basic idea is to have a bit vector scheme until the total number of sharers is not more than the directory entry width
    • When the number of sharers overflows the hardware resorts to an “overflow scheme”
      • DiriB: i sharer bits, broadcast invalidation on overflow
      • DiriNB: pick one sharer and invalidate it
      • DiriCV: assign one bit to a group of nodes of size P/i; broadcast invalidations to that group on a write
        • May generate useless invalidations
  • DiriDP (Stanford FLASH)
    • DP stands for dynamic pointer
    • Allocate directory entries from a free list pool maintained in memory
    • Need replacement hints
    • Still may run into reclamation mode if free list pool is not sized properly at boot time
      • How do you size it?
    • If replacement hints are not supported, assume k sharers on average per cache block (k=8 is found to be good)
    • Reclamation algorithms?
      • Pick a random cache line and invalidate it
  • DiriSW (MIT Alewife)
    • Trap to software on overflow
    • Software maintains the information about overflown sharers
    • MIT Alewife has directory entry of five pointers and a local bit (i.e. overflow threshold is five or six)
      • Remote read before overflow takes 40 cycles and after overflow takes 425 cycles
      • Five invalidations take 84 cycles while six invalidations take 707 cycles

Sparse directory

  • How to reduce the height of the directory?
    • Observation: total number of cache blocks in all processors is far less than total number of memory blocks
      • Assume a 32 MB L3 cache and 4 GB memory: less than 1% of directory entries are active at any point in time
    • Idea is to organize directory as a highly associative cache
    • On a directory entry “eviction” send invalidations to all sharers or retrieve line if dirty