|
SMT
- Discussed simultaneous multithreading (SMT)
- Basic goal is to run multiple threads at the same time
- Helps in hiding large memory latency because even if one thread is blocked due to a cache miss, it is still possible to schedule ready instructions from other threads without taking the overhead of context switch
- Improves memory level parallelism (MLP)
- Overall, improves resource utilization enormously as compared to a superscalar processor
- Latency of a particular thread may not improve, but the overall throughput of the system increases (i.e. average number of retired instructions per cycle)
Multi-threading
- Three design choices for single-core hardware multi-threading
- Coarse-grain multithreading: Execute one thread at a time; when the running thread is blocked on a long-latency event e.g., cache miss, swap in a new thread; this swap can take place in hardware (needs extra support and extra cycles for flushing the pipe and saving register values unless renamed registers remain pinned)
- Fine-grain multithreading: Fetch, decode, rename, issue, execute instructions from threads in round robin fashion; improved utilization across cycles, but problem remains within cycle; also if a thread gets blocked on a long-latency event its slots will go wasted for many cycles
- Simultaneous multithreading (SMT): Mix instructions from all threads every cycle; maximum utilization of resources
Problems of SMT
- Offers a processor that can deliver reasonably good multithreaded performance with fine-grained fast communication through cache
- Although it is possible to design an SMT processor with small die area increase (5% in Pentium 4), for good performance it is necessary to rethink about resource allocation policies at various stages of the pipe
- Also, verifying an SMT processor is much harder than the basic underlying superscalar design
- Must think about various deadlock/livelock possibilities since the threads interact with each other through shared resources on a per-cycle basis
- Why not exploit the transistors available today to just replicate existing superscalar cores and design a single chip multiprocessor (CMP)?
CMP
- CMP is the mantra of today’s microprocessor industry
- Intel’s dual-core Pentium 4: each core is still hyperthreaded (just uses existing cores)
- Intel’s quad-core Whitefield is coming up in a year or so
- For the server market Intel has announced a dual-core Itanium 2 (code named Montecito); again each core is 2-way threaded
- AMD has released dual-core Opteron in 2005
- IBM released their first dual-core processor POWER4 circa 2001; next-generation POWER5 also uses two cores but each core is also 2-way threaded
- Sun’s UltraSPARC IV (released in early 2004) is a dual-core processor and integrates two UltraSPARC III cores
Why CMP?
- Today microprocessor designers can afford to have a lot of transistors on the die
- Ever-shrinking feature size leads to dense packing
- What would you do with so many transistors?
- Can invest some to cache, but beyond a certain point it doesn’t help
- Natural choice was to think about greater level of integration
- Few chip designers decided to bring the memory and coherence controllers along with the router on the die
- The next obvious choice was to replicate the entire core; it is fairly simple: just use the existing cores and connect them through a coherent interconnect
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|