|
Multi-cycle execution
- Simplest implementation
- Assume each of five stages takes a cycle
- Five cycles to execute an instruction
- After instruction i finishes you start fetching instruction i+1
- Without “long latency” instructions CPI is 5
- Alternative implementation
- You could have a five times slower clock to accommodate all the logic within one cycle
- Then you can say CPI is 1 excluding mult/div, mem op
- But overall execution time really doesn’t change
- What can you do to lower the CPI?
Pipelining
- Simple observation
- In the multi-cycle implementation when the ALU is executing, say, an add instruction the decoder is idle
- Exactly one stage is active at any point in time
- Wastage of hardware
- Solution: pipelining
- Process five instructions in parallel
- Each instruction is in a different stage of processing
- Each stage is called a pipeline stage
- Need registers between pipeline stages to hold partially processed instructions (called pipeline latches): why?
More on pipelining
- What do you gain?
- Parallelism: called instruction-level parallelism (ILP)
- Ideal CPI of 1 at the same clock speed as multi-cycle implementation: ideally 5 times reduction in execution time
- What are the problems?
- Slightly more complex
- Control and data hazards
- These hazards put a limit on available ILP
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|