Objectives_template

	Large cache line Large cache lines are intended to amortize the DRAM access and bus transfer latency over a large number of data points But false sharing becomes a problem Hardware solutions Coherence at subblock level: divide the cache line into smaller blocks and maintain coherence for each of them; subblock invalidation on a write reduces chances of coherence misses even in the presence of false sharing Delay invalidations: send invalidations only after the writer has completed several writes; but this directly impacts the write propagation model and hence leads to consistency models weaker than SC Use update-based protocols instead of invalidation-based: probably not a good idea Performance of update protocol Already discussed main trade-offs Consider running a sequential program on an SMP with update protocol If the kernel decides to migrate the process to a different processor subsequent updates will go to caches that are never used: “pack-rat” phenomenon Possible designs that combine update and invalidation-based protocols For each page, decide what type of protocol to run and make it part of the translation (i.e. hold it in TLB) Otherwise dynamically detect for each cache line what protocol is good Hybrid inval+update One possible hybrid protocol Keep a counter per cache line and make Dragon update protocol the default Every time the local processor accesses a cache line set its counter to some pre-defined threshold k On each received update decrease the counter by one When the counter reaches zero, the line is locally invalidated hoping that eventually the writer will switch to M state from Sm state when no sharers are left Update-based protocol Update-based protocols tend to increase capacity misses slightly Cache lines stay longer in the cache compared to an invalidation-based protocol; why? Update-based protocols can significantly reduce coherence misses True sharing misses definitely go down False sharing misses also decrease due to absence of invalidations But update-based protocols significantly increase bus bandwidth demand This increases bus contention and delays other transactions Possible to delay the updates by merging a number of them in a buffer Shared cache Advantages If there is only one level of cache no need for a coherence protocol Very fine-grained sharing resulting in fast cross-thread communication No false sharing Smaller cache capacity requirement: overlapped working set One processor’s fetched cache line can be used by others: prefetch effect Disadvantages High cache bandwidth requirement and port contention Destructive interference and conflict misses Will revisit this when discussing chip multiprocessing and hyper-threading