Objectives_template

	Delayed consistency Designed especially for update-based protocols Delay the updates until write buffer becomes full or until the next synchronization point or some other execution point Aims at reducing update traffic Possible to merge many updates to the same cache line Reasoning about program may become difficult Same thing can be done for invalidation-based protocols also: Keep stores in the store buffer (need a large one) until the store buffer fills up or the next synchronization point is reached; at this point issue these stores to the memory system (i.e. send to home, send invalidations, etc.); TSO, PSO, PC, WO, RC can take advantage of this (invalidation traffic becomes bursty; but false sharing is reduced) [Transactional consistency] Conclusions Why consistency models? Server designers don’t deal with these notorious issues just for fun The problem is that some people feel SC is too restrictive; relaxed models are needed to get good performance possibly compromising some simplicity on the programmers’ side (of course, you should not invent a consistency model that is so unintuitive that it becomes unusable) Pipelining with sufficient buffering (to support many in-flight instructions) and with some extra “fix-up” hardware, SC can provide fairly good performance Store buffers provide some write latency overlap in SC microprocessors, but the ROB size is pathetically small to overlap the entire invalidation sending and acknowledgment collection phases; TSO, PC, PSO offer better choices TSO, PC, PSO allow stores to be removed from the head of the ROB as soon as they issue, thereby unclogging the ROB and allowing subsequent instructions (including loads) to commit i.e. loads can bypass stores In TSO and PC, a remote store at the head of store buffer may not allow subsequent stores (which may be local and even cache hits) to leave; PSO solves this problem by allowing stores to bypass previous stores Soon designers realized that loads are even bigger problem because they are more frequent than stores on RISC processors; so the capability to bypass loads (and stores) was formalized in RC and WO (taking full advantage of this in hardware is very difficult, though) Surprisingly, all microprocessors today implement all these re-orderings inside the pipeline; then why not take a little more pain and implement SC