| |
Directory overhead
- Quadratic in number of processors for bitvector
- Assume P processors, each with M amount of local memory (i.e. total shared memory size is M*P)
- Let the coherence granularity (cache block size) be B
- Number of cache blocks per node = M/B = number of directory entries per node
- Size of one directory entry = P + O(1)
- Total size of directory memory across all processors = (M/B)(P+O(1))*P = O(P2)
Path of a read miss
- Assume that the line is not shared by anyone
- Load issues from load queue (for data) or fetcher accesses icache; looks up TLB and gets PA
- Misses in L1, L2, L3,… caches
- Launches address and request type on system bus
- The request gets queued in memory controller and registered in OTT or TTT (Outstanding Transaction Table or Transactions in Transit Table)
- Memory controller eventually schedules the request
- Decodes home node from upper few bits of address
- Local home: access directory and data memory (how?)
- Remote home: request gets queued in network interface
- From NI onward
- Eventually the request gets forwarded to the router and through the network to the home
- At the home the request gets queued in NI and waits for being scheduled by the home memory controller
- After it is scheduled home memory controller looks up directory and data memory
- Reply returns through the same path
- Total time (by log model and memory latency m)
- Local home: max(kho, m)
- Remote home: kro + gh+a + Nℓ + gh+a + max(kho, m) + gh+a+d + Nℓ + gh+a+d + kro
Correctness issues
- Serialization to a location
- Schedule order at home
- Use NACKs (extra traffic and livelock) or smarter techniques (back-off, NACK-free)
- Flow control deadlock
- Avoid buffer dependence cycles
- Avoid network queue dependence cycles
- Virtual networks multiplexed on physical networks
- Coherence protocol dictates the virtual network usage
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
|
|
|
|