Objectives_template

	The pipeline Fetch, decode, rename, issue, register file read, ALU, cache, retire Fetch, decode, rename are in-order stages, each handles multiple instructions every cycle The ROB entry is allocated in rename stage Issue, register file, ALU, cache are out-of-order Retire is again in-order, but multiple instructions may retire each cycle: need to free the resources and drain the pipeline quickly What limits ILP now? Instruction cache miss (normally not a big issue) Branch misprediction Observe that you predict a branch in decode, and the branch executes in ALU There are four pipeline stages before you know outcome Misprediction amounts to loss of at least 4F instructions where F is the fetch width Data cache miss Assuming a issue width of 4, frequency of 3 GHz, memory latency of 120 ns, you need to find 1440 independent instructions to issue so that you can hide the memory latency: this is impossible (resource shortage) Cycle time reduction Execution time = CPI × instruction count × cycle time Talked about CPI reduction or improvement in IPC (instructions retired per cycle) Cycle time reduction is another technique to boost performance Faster clock frequency Pipelining poses a problem Each pipeline stage should be one cycle for balanced progress Smaller cycle time means need to break pipe stages into smaller stages Superpipelining Faster clock frequency necessarily means deep pipes Each pipe stage contains small amount of logic so that it fits in small cycle time May severely degrade CPI if not careful Now branch penalty is even bigger (31 cycles for Intel Prescott): branch mispredictions cause massive loss in performance (93 micro-ops are lost, F=3) Long pipes also put more pressure on resources such as ROB and registers because instruction latency increases (in terms of cycles, not in absolute terms) Instructions occupy ROB entries and registers longer The design becomes increasingly complicated (long wires)