Module 4: "Recap: Virtual Memory and Caches"
  Lecture 8: "Cache Hierarchy and Memory-level Parallelism"
 

MLP

  • Need memory-level parallelism (MLP)
    • Simply speaking, need to mutually overlap several memory operations
  • Step 1: Non-blocking cache
    • Allow multiple outstanding cache misses
    • Mutually overlap multiple cache misses
    • Supported by all microprocessors today (Alpha 21364 supported 16 outstanding cache misses)
  • Step 2: Out-of-order load issue
    • Issue loads out of program order (address is not known at the time of issue)
    • How do you know the load didn’t issue before a store to the same address? Issuing stores must check for this memory-order violation

Out-of-order loads

sw  0(r7), r6
… /* other instructions */
lw  r2, 80(r20)

  • Assume that the load issues before the store because r20 gets ready before r6 or r7
  • The load accesses the store buffer (used for holding already executed store values before they are committed to the cache at retirement)
  • If it misses in the store buffer it looks up the caches and, say, gets the value somewhere
  • After several cycles the store issues and it turns out that 0(r7)==80(r20) or they overlap; now what?

Load/store ordering

  • Out-of-order load issue relies on speculative memory disambiguation
    • Assumes that there will be no conflicting store
    • If the speculation is correct, you have issued the load much earlier and you have allowed the dependents to also execute much earlier
    • If there is a conflicting store, you have to squash the load and all the dependents that have consumed the load value and re-execute them systematically
    • Turns out that the speculation is correct most of the time
    • To further minimize the load squash, microprocessors use simple memory dependence predictors (predicts if a load is going to conflict with a pending store based on that load’s or load/store pairs’ past behavior)