Module 15: "Memory Consistency Models"
  Lecture 34: "Sequential Consistency and Relaxed Models"
 

Relaxed models

  • Implementing SC requires complex hardware
    • Is there an example that clearly shows the disaster of not implementing all these?
      • Observe that cache coherence protocol is orthogonal
    • But such violations are rare
    • Does it make sense to invest so much time (for verification) and hardware (associative lookup logic in load queue)?
    • Many processors today relax the consistency model to get rid of complex hardware and achieve some extra performance at the cost of making program reasoning complex
    • P0: A=1; B=1; flag=1; P1: while (!flag); print A; print B;
    • SC is too restrictive; relaxing it does not always violate programmers’ intuition
  • Three attributes
    • System specification: which orders are preserved and which are not; if all program orders are not preserved what support is provided (software and hardware) to enforce a particular order that the programmer wishes
    • Programmer’s interface: set of rules, if followed, will lead to an execution as expected by the programmer; normally specified in terms of high-level language annotations and labels
    • Translation mechanism: how to translate programmer’s annotations to hardware actions
  • Let’s take a look at a few relaxed models: TSO, PSO, PC, WO/WC, RC, DC

Total store ordering

  • Allows a read to bypass (i.e. commit before) an earlier incomplete write
    • This essentially means a blocked store at the head of the ROB can be removed (but remains in write buffer) and subsequent instructions are allowed to commit bypassing the blocked store
    • Can hide latency of write operations
    • Note that this is the only allowed re-ordering
    • Programmer’s intuition is preserved in most cases, but not always
    • P0: A=1; flag=1; P1: while (!flag); print A;    [same as SC]
    • P0: A=1; B=1; P1: print B; print A;    [same as SC]
    • P0: A=1; print B; P1: B=1; print A;  [violates SC]
    • Implemented in many Sun UltraSPARC microprocessors
  • How do I enforce SC in the last example if I really care?
    • May be needed when porting this program from R10000 to UltraSPARC
    • Must ensure that a read cannot bypass earlier writes
    • Microprocessors provide “fence” instructions for this purpose
    • SPARC v9 specification provides MEMBAR (memory barrier) instruction of different flavors
    • Here we only need to use one of these flavors, namely, write-to-read fence just before the load instruction
    • This fence will not allow graduation of load until all stores before it graduates
    • If fence instruction is not available, substituting the read by a read-modify-write (e.g., ldstub in SPARC) also works