Objectives_template

	Multi-cycle execution Simplest implementation Assume each of five stages takes a cycle Five cycles to execute an instruction After instruction i finishes you start fetching instruction i+1 Without “long latency” instructions CPI is 5 Alternative implementation You could have a five times slower clock to accommodate all the logic within one cycle Then you can say CPI is 1 excluding mult/div, mem op But overall execution time really doesn’t change What can you do to lower the CPI? Pipelining Simple observation In the multi-cycle implementation when the ALU is executing, say, an add instruction the decoder is idle Exactly one stage is active at any point in time Wastage of hardware Solution: pipelining Process five instructions in parallel Each instruction is in a different stage of processing Each stage is called a pipeline stage Need registers between pipeline stages to hold partially processed instructions (called pipeline latches): why? More on pipelining What do you gain? Parallelism: called instruction-level parallelism (ILP) Ideal CPI of 1 at the same clock speed as multi-cycle implementation: ideally 5 times reduction in execution time What are the problems? Slightly more complex Control and data hazards These hazards put a limit on available ILP