|
The pipeline
- Fetch, decode, rename, issue, register file read, ALU, cache, retire
- Fetch, decode, rename are in-order stages, each handles multiple instructions every cycle
- The ROB entry is allocated in rename stage
- Issue, register file, ALU, cache are out-of-order
- Retire is again in-order, but multiple instructions may retire each cycle: need to free the resources and drain the pipeline quickly
What limits ILP now?
- Instruction cache miss (normally not a big issue)
- Branch misprediction
- Observe that you predict a branch in decode, and the branch executes in ALU
- There are four pipeline stages before you know outcome
- Misprediction amounts to loss of at least 4F instructions where F is the fetch width
- Data cache miss
- Assuming a issue width of 4, frequency of 3 GHz, memory latency of 120 ns, you need to find 1440 independent instructions to issue so that you can hide the memory latency: this is impossible (resource shortage)
Cycle time reduction
- Execution time = CPI × instruction count × cycle time
- Talked about CPI reduction or improvement in IPC (instructions retired per cycle)
- Cycle time reduction is another technique to boost performance
- Pipelining poses a problem
- Each pipeline stage should be one cycle for balanced progress
- Smaller cycle time means need to break pipe stages into smaller stages
- Superpipelining
- Faster clock frequency necessarily means deep pipes
- Each pipe stage contains small amount of logic so that it fits in small cycle time
- May severely degrade CPI if not careful
- Now branch penalty is even bigger (31 cycles for Intel Prescott): branch mispredictions cause massive loss in performance (93 micro-ops are lost, F=3)
- Long pipes also put more pressure on resources such as ROB and registers because instruction latency increases (in terms of cycles, not in absolute terms)
- Instructions occupy ROB entries and registers longer
- The design becomes increasingly complicated (long wires)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|