Objectives_template

	Relaxed models Implementing SC requires complex hardware Is there an example that clearly shows the disaster of not implementing all these? Observe that cache coherence protocol is orthogonal But such violations are rare Does it make sense to invest so much time (for verification) and hardware (associative lookup logic in load queue)? Many processors today relax the consistency model to get rid of complex hardware and achieve some extra performance at the cost of making program reasoning complex P0: A=1; B=1; flag=1; P1: while (!flag); print A; print B; SC is too restrictive; relaxing it does not always violate programmers’ intuition Three attributes System specification: which orders are preserved and which are not; if all program orders are not preserved what support is provided (software and hardware) to enforce a particular order that the programmer wishes Programmer’s interface: set of rules, if followed, will lead to an execution as expected by the programmer; normally specified in terms of high-level language annotations and labels Translation mechanism: how to translate programmer’s annotations to hardware actions Let’s take a look at a few relaxed models: TSO, PSO, PC, WO/WC, RC, DC Total store ordering Allows a read to bypass (i.e. commit before) an earlier incomplete write This essentially means a blocked store at the head of the ROB can be removed (but remains in write buffer) and subsequent instructions are allowed to commit bypassing the blocked store Can hide latency of write operations Note that this is the only allowed re-ordering Programmer’s intuition is preserved in most cases, but not always P0: A=1; flag=1; P1: while (!flag); print A; [same as SC] P0: A=1; B=1; P1: print B; print A; [same as SC] P0: A=1; print B; P1: B=1; print A; [violates SC] Implemented in many Sun UltraSPARC microprocessors How do I enforce SC in the last example if I really care? May be needed when porting this program from R10000 to UltraSPARC Must ensure that a read cannot bypass earlier writes Microprocessors provide “fence” instructions for this purpose SPARC v9 specification provides MEMBAR (memory barrier) instruction of different flavors Here we only need to use one of these flavors, namely, write-to-read fence just before the load instruction This fence will not allow graduation of load until all stores before it graduates If fence instruction is not available, substituting the read by a read-modify-write (e.g., ldstub in SPARC) also works