| |
Starvation
- NACKs can cause starvation
- Build a FIFO list of waiters either in home memory (Chaudhuri and Heinrich, 2004) or use a distributed linked list (IEEE Scalable Coherent Interface)
- Former imposes large occupancy on home, yet offers better performance by read combining
- Latter is an extremely complex protocol with a large number of transient states and 29 stable states, but does distribute the occupancy across the system
- Origin 2000 devotes extra bits in the directory to raise priority of requests NACKed too many times (above a threshold)
- Use delay between retries
- Use Alpha GS320 protocol (will discuss later)
Overflow schemes
- How to make the directory size independent of the number of processors
- Basic idea is to have a bit vector scheme until the total number of sharers is not more than the directory entry width
- When the number of sharers overflows the hardware resorts to an “overflow scheme”
- DiriB: i sharer bits, broadcast invalidation on overflow
- DiriNB: pick one sharer and invalidate it
- DiriCV: assign one bit to a group of nodes of size P/i; broadcast invalidations to that group on a write
- May generate useless invalidations
- DiriDP (Stanford FLASH)
- DP stands for dynamic pointer
- Allocate directory entries from a free list pool maintained in memory
- Need replacement hints
- Still may run into reclamation mode if free list pool is not sized properly at boot time
- If replacement hints are not supported, assume k sharers on average per cache block (k=8 is found to be good)
- Reclamation algorithms?
- Pick a random cache line and invalidate it
- DiriSW (MIT Alewife)
- Trap to software on overflow
- Software maintains the information about overflown sharers
- MIT Alewife has directory entry of five pointers and a local bit (i.e. overflow threshold is five or six)
- Remote read before overflow takes 40 cycles and after overflow takes 425 cycles
- Five invalidations take 84 cycles while six invalidations take 707 cycles
Sparse directory
- How to reduce the height of the directory?
- Observation: total number of cache blocks in all processors is far less than total number of memory blocks
- Assume a 32 MB L3 cache and 4 GB memory: less than 1% of directory entries are active at any point in time
- Idea is to organize directory as a highly associative cache
- On a directory entry “eviction” send invalidations to all sharers or retrieve line if dirty
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
|
|
|
|