Objectives_template

	Directory overhead Quadratic in number of processors for bitvector Assume P processors, each with M amount of local memory (i.e. total shared memory size is MP) Let the coherence granularity (cache block size) be B Number of cache blocks per node = M/B = number of directory entries per node Size of one directory entry = P + O(1) Total size of directory memory across all processors = (M/B)(P+O(1))P = O(P²) Path of a read miss Assume that the line is not shared by anyone Load issues from load queue (for data) or fetcher accesses icache; looks up TLB and gets PA Misses in L1, L2, L3,… caches Launches address and request type on system bus The request gets queued in memory controller and registered in OTT or TTT (Outstanding Transaction Table or Transactions in Transit Table) Memory controller eventually schedules the request Decodes home node from upper few bits of address Local home: access directory and data memory (how?) Remote home: request gets queued in network interface From NI onward Eventually the request gets forwarded to the router and through the network to the home At the home the request gets queued in NI and waits for being scheduled by the home memory controller After it is scheduled home memory controller looks up directory and data memory Reply returns through the same path Total time (by log model and memory latency m) Local home: max(k_ho, m) Remote home: k_ro + g_h+a + Nℓ + g_h+a + max(k_ho, m) + g_h+a+d + Nℓ + g_h+a+d + k_ro Correctness issues Serialization to a location Schedule order at home Use NACKs (extra traffic and livelock) or smarter techniques (back-off, NACK-free) Flow control deadlock Avoid buffer dependence cycles Avoid network queue dependence cycles Virtual networks multiplexed on physical networks Coherence protocol dictates the virtual network usage