Module 14: "Directory-based Cache Coherence"
  Lecture 29: "Basics of Directory"
 

Sharing pattern

  • Problem is with the writes
    • Frequently written cache lines exhibit a small number of sharers; so small number of invalidations
    • Widely shared data are written infrequently; so large number of invalidations, but rare
    • Synchronization variables are notorious: heavily contended locks are widely shared and written in quick succession generating a burst of invalidations; require special solutions such as queue locks or tree barriers
    • What about interventions? These are very problematic because in these cases you cannot send the interventions before looking up the directory and any speculative memory lookup would be useless
    • For scientific applications interventions are small due to mostly one producer-many consumer pattern; for database workloads these take the lion’s share due to migratory pattern and tend to increase with bigger cache
  • Optimizing interventions related to migratory sharing has been a major focus of high-end scalable servers
    • AlphaServer GS320 employs few optimizations to quickly resolve races related to migratory hand-off (more later)
    • Some academic research looked at destination or owner prediction to speculatively send interventions even before consulting the directory (Martin and Hill 2003, Acacio et al 2002)
  • In general, directory provides far better utilization of bandwidth for scalable MPs compared to broadcast

Directory organization

  • How to find source of directory information
    • Centralized: just access the directory (bandwidth limited)
    • Distributed: flat scheme distributes directory with memory and every cache line has a home node where its memory and directory reside
    • Hierarchical scheme organizes the processors as the leaves of a logical tree (need not be binary) and an internal node stores the directory entries for the memory lines local to its children; a directory entry essentially tells you which of its children subtrees are caching the line and if some subtree which is not its children is also caching; finding the directory entry of a cache line involves a traversal of the tree until the entry is found (inclusion is maintained between level k and k+1 directory node where the root is at the highest level i.e. in the worst case may have to go to the root to find dir.)
  • Format of a directory entry
    • Varies a lot: no specific rule
    • Memory-based scheme: directory entry is co-located in the home node with the memory line; various organizations can be used; the most popular one is a simple bit vector (with a 128 bytes line, storage overhead for 64 nodes is 6.35%, for 256 nodes 25%, for 1024 nodes 100%); clearly does not scale with P (more later)
    • Cache-based scheme: Organize the directory as a distributed linked-list where the sharer nodes form a chain; the cache tag is extended to hold a node number; the home node only knows the id of the first sharer; on a read miss the requester adds itself to the head (involves home and first sharer); on a write miss traverse list and invalidate (essentially serialized chain of messages); advantage: distributes contention and does not make the home node a hot-spot, storage overhead is fixed; but very complex (IEEE SCI standard) 
  • Lot of research has been done to reduce directory storage overhead
    • The trade-off is between preciseness of information and performance
    • Normal trick is to have a superset of information e.g., group every two sharers into a cluster and have a bit per cluster: may lead to one useless invalidation per cluster
    • We will explore this in detail later
    • Memory-based bitvector scheme is very popular: invalidations can be overlapped or multicast
    • Cache-based schemes incur serialized message chain for invalidation
    • Hierarchical schemes are not used much due to high latency and volume of messages (up and down tree); also root may become a bandwidth bottleneck