|
Large cache line
- Large cache lines are intended to amortize the DRAM access and bus transfer latency over a large number of data points
- But false sharing becomes a problem
- Hardware solutions
- Coherence at subblock level: divide the cache line into smaller blocks and maintain coherence for each of them; subblock invalidation on a write reduces chances of coherence misses even in the presence of false sharing
- Delay invalidations: send invalidations only after the writer has completed several writes; but this directly impacts the write propagation model and hence leads to consistency models weaker than SC
- Use update-based protocols instead of invalidation-based: probably not a good idea
Performance of update protocol
- Already discussed main trade-offs
- Consider running a sequential program on an SMP with update protocol
- If the kernel decides to migrate the process to a different processor subsequent updates will go to caches that are never used: “pack-rat” phenomenon
- Possible designs that combine update and invalidation-based protocols
- For each page, decide what type of protocol to run and make it part of the translation (i.e. hold it in TLB)
- Otherwise dynamically detect for each cache line what protocol is good
Hybrid inval+update
- One possible hybrid protocol
- Keep a counter per cache line and make Dragon update protocol the default
- Every time the local processor accesses a cache line set its counter to some pre-defined threshold k
- On each received update decrease the counter by one
- When the counter reaches zero, the line is locally invalidated hoping that eventually the writer will switch to M state from Sm state when no sharers are left
Update-based protocol
- Update-based protocols tend to increase capacity misses slightly
- Cache lines stay longer in the cache compared to an invalidation-based protocol; why?
- Update-based protocols can significantly reduce coherence misses
- True sharing misses definitely go down
- False sharing misses also decrease due to absence of invalidations
- But update-based protocols significantly increase bus bandwidth demand
- This increases bus contention and delays other transactions
- Possible to delay the updates by merging a number of them in a buffer
Shared cache
- Advantages
- If there is only one level of cache no need for a coherence protocol
- Very fine-grained sharing resulting in fast cross-thread communication
- No false sharing
- Smaller cache capacity requirement: overlapped working set
- One processor’s fetched cache line can be used by others: prefetch effect
- Disadvantages
- High cache bandwidth requirement and port contention
- Destructive interference and conflict misses
- Will revisit this when discussing chip multiprocessing and hyper-threading
|