Objectives_template

	Sharing pattern Problem is with the writes Frequently written cache lines exhibit a small number of sharers; so small number of invalidations Widely shared data are written infrequently; so large number of invalidations, but rare Synchronization variables are notorious: heavily contended locks are widely shared and written in quick succession generating a burst of invalidations; require special solutions such as queue locks or tree barriers What about interventions? These are very problematic because in these cases you cannot send the interventions before looking up the directory and any speculative memory lookup would be useless For scientific applications interventions are small due to mostly one producer-many consumer pattern; for database workloads these take the lion’s share due to migratory pattern and tend to increase with bigger cache Optimizing interventions related to migratory sharing has been a major focus of high-end scalable servers AlphaServer GS320 employs few optimizations to quickly resolve races related to migratory hand-off (more later) Some academic research looked at destination or owner prediction to speculatively send interventions even before consulting the directory (Martin and Hill 2003, Acacio et al 2002) In general, directory provides far better utilization of bandwidth for scalable MPs compared to broadcast Directory organization How to find source of directory information Centralized: just access the directory (bandwidth limited) Distributed: flat scheme distributes directory with memory and every cache line has a home node where its memory and directory reside Hierarchical scheme organizes the processors as the leaves of a logical tree (need not be binary) and an internal node stores the directory entries for the memory lines local to its children; a directory entry essentially tells you which of its children subtrees are caching the line and if some subtree which is not its children is also caching; finding the directory entry of a cache line involves a traversal of the tree until the entry is found (inclusion is maintained between level k and k+1 directory node where the root is at the highest level i.e. in the worst case may have to go to the root to find dir.) Format of a directory entry Varies a lot: no specific rule Memory-based scheme: directory entry is co-located in the home node with the memory line; various organizations can be used; the most popular one is a simple bit vector (with a 128 bytes line, storage overhead for 64 nodes is 6.35%, for 256 nodes 25%, for 1024 nodes 100%); clearly does not scale with P (more later) Cache-based scheme: Organize the directory as a distributed linked-list where the sharer nodes form a chain; the cache tag is extended to hold a node number; the home node only knows the id of the first sharer; on a read miss the requester adds itself to the head (involves home and first sharer); on a write miss traverse list and invalidate (essentially serialized chain of messages); advantage: distributes contention and does not make the home node a hot-spot, storage overhead is fixed; but very complex (IEEE SCI standard) Lot of research has been done to reduce directory storage overhead The trade-off is between preciseness of information and performance Normal trick is to have a superset of information e.g., group every two sharers into a cluster and have a bit per cluster: may lead to one useless invalidation per cluster We will explore this in detail later Memory-based bitvector scheme is very popular: invalidations can be overlapped or multicast Cache-based schemes incur serialized message chain for invalidation Hierarchical schemes are not used much due to high latency and volume of messages (up and down tree); also root may become a bandwidth bottleneck