| |
Sharing pattern
- Problem is with the writes
- Frequently written cache lines exhibit a small number of sharers; so small number of invalidations
- Widely shared data are written infrequently; so large number of invalidations, but rare
- Synchronization variables are notorious: heavily contended locks are widely shared and written in quick succession generating a burst of invalidations; require special solutions such as queue locks or tree barriers
- What about interventions? These are very problematic because in these cases you cannot send the interventions before looking up the directory and any speculative memory lookup would be useless
- For scientific applications interventions are small due to mostly one producer-many consumer pattern; for database workloads these take the lion’s share due to migratory pattern and tend to increase with bigger cache
- Optimizing interventions related to migratory sharing has been a major focus of high-end scalable servers
- AlphaServer GS320 employs few optimizations to quickly resolve races related to migratory hand-off (more later)
- Some academic research looked at destination or owner prediction to speculatively send interventions even before consulting the directory (Martin and Hill 2003, Acacio et al 2002)
- In general, directory provides far better utilization of bandwidth for scalable MPs compared to broadcast
Directory organization
- How to find source of directory information
- Centralized: just access the directory (bandwidth limited)
- Distributed: flat scheme distributes directory with memory and every cache line has a home node where its memory and directory reside
- Hierarchical scheme organizes the processors as the leaves of a logical tree (need not be binary) and an internal node stores the directory entries for the memory lines local to its children; a directory entry essentially tells you which of its children subtrees are caching the line and if some subtree which is not its children is also caching; finding the directory entry of a cache line involves a traversal of the tree until the entry is found (inclusion is maintained between level k and k+1 directory node where the root is at the highest level i.e. in the worst case may have to go to the root to find dir.)
- Format of a directory entry
- Varies a lot: no specific rule
- Memory-based scheme: directory entry is co-located in the home node with the memory line; various organizations can be used; the most popular one is a simple bit vector (with a 128 bytes line, storage overhead for 64 nodes is 6.35%, for 256 nodes 25%, for 1024 nodes 100%); clearly does not scale with P (more later)
- Cache-based scheme: Organize the directory as a distributed linked-list where the sharer nodes form a chain; the cache tag is extended to hold a node number; the home node only knows the id of the first sharer; on a read miss the requester adds itself to the head (involves home and first sharer); on a write miss traverse list and invalidate (essentially serialized chain of messages); advantage: distributes contention and does not make the home node a hot-spot, storage overhead is fixed; but very complex (IEEE SCI standard)
- Lot of research has been done to reduce directory storage overhead
- The trade-off is between preciseness of information and performance
- Normal trick is to have a superset of information e.g., group every two sharers into a cluster and have a bit per cluster: may lead to one useless invalidation per cluster
- We will explore this in detail later
- Memory-based bitvector scheme is very popular: invalidations can be overlapped or multicast
- Cache-based schemes incur serialized message chain for invalidation
- Hierarchical schemes are not used much due to high latency and volume of messages (up and down tree); also root may become a bandwidth bottleneck
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
|
|
|
|