Module 3: "Recap: Single-threaded Execution"
  Lecture 6: "Instruction Issue Algorithms"
 

The pipeline

  • Fetch, decode, rename, issue, register file read, ALU, cache, retire
  • Fetch, decode, rename are in-order stages, each handles multiple instructions every cycle
  • The ROB entry is allocated in rename stage
  • Issue, register file, ALU, cache are out-of-order
  • Retire is again in-order, but multiple instructions may retire each cycle: need to free the resources and drain the pipeline quickly

What limits ILP now?

  • Instruction cache miss (normally not a big issue)
  • Branch misprediction
    • Observe that you predict a branch in decode, and the branch executes in ALU
    • There are four pipeline stages before you know outcome
    • Misprediction amounts to loss of at least 4F instructions where F is the fetch width
  • Data cache miss
    • Assuming a issue width of 4, frequency of 3 GHz, memory latency of 120 ns, you need to find 1440 independent instructions to issue so that you can hide the memory latency: this is impossible (resource shortage)

Cycle time reduction

  • Execution time = CPI × instruction count × cycle time
  • Talked about CPI reduction or improvement in IPC (instructions retired per cycle)
  • Cycle time reduction is another technique to boost performance
    • Faster clock frequency
  • Pipelining poses a problem
    • Each pipeline stage should be one cycle for balanced progress
    • Smaller cycle time means need to break pipe stages into smaller stages
  • Superpipelining
    • Faster clock frequency necessarily means deep pipes
    • Each pipe stage contains small amount of logic so that it fits in small cycle time
    • May severely degrade CPI if not careful
    • Now branch penalty is even bigger (31 cycles for Intel Prescott): branch mispredictions cause massive loss in performance (93 micro-ops are lost, F=3)
    • Long pipes also put more pressure on resources such as ROB and registers because instruction latency increases (in terms of cycles, not in absolute terms)
    • Instructions occupy ROB entries and registers longer
    • The design becomes increasingly complicated (long wires)