Objectives_template

	SMT Discussed simultaneous multithreading (SMT) Basic goal is to run multiple threads at the same time Helps in hiding large memory latency because even if one thread is blocked due to a cache miss, it is still possible to schedule ready instructions from other threads without taking the overhead of context switch Improves memory level parallelism (MLP) Overall, improves resource utilization enormously as compared to a superscalar processor Latency of a particular thread may not improve, but the overall throughput of the system increases (i.e. average number of retired instructions per cycle) Multi-threading Three design choices for single-core hardware multi-threading Coarse-grain multithreading: Execute one thread at a time; when the running thread is blocked on a long-latency event e.g., cache miss, swap in a new thread; this swap can take place in hardware (needs extra support and extra cycles for flushing the pipe and saving register values unless renamed registers remain pinned) Fine-grain multithreading: Fetch, decode, rename, issue, execute instructions from threads in round robin fashion; improved utilization across cycles, but problem remains within cycle; also if a thread gets blocked on a long-latency event its slots will go wasted for many cycles Simultaneous multithreading (SMT): Mix instructions from all threads every cycle; maximum utilization of resources Problems of SMT Offers a processor that can deliver reasonably good multithreaded performance with fine-grained fast communication through cache Although it is possible to design an SMT processor with small die area increase (5% in Pentium 4), for good performance it is necessary to rethink about resource allocation policies at various stages of the pipe Also, verifying an SMT processor is much harder than the basic underlying superscalar design Must think about various deadlock/livelock possibilities since the threads interact with each other through shared resources on a per-cycle basis Why not exploit the transistors available today to just replicate existing superscalar cores and design a single chip multiprocessor (CMP)? CMP CMP is the mantra of today’s microprocessor industry Intel’s dual-core Pentium 4: each core is still hyperthreaded (just uses existing cores) Intel’s quad-core Whitefield is coming up in a year or so For the server market Intel has announced a dual-core Itanium 2 (code named Montecito); again each core is 2-way threaded AMD has released dual-core Opteron in 2005 IBM released their first dual-core processor POWER4 circa 2001; next-generation POWER5 also uses two cores but each core is also 2-way threaded Sun’s UltraSPARC IV (released in early 2004) is a dual-core processor and integrates two UltraSPARC III cores Why CMP? Today microprocessor designers can afford to have a lot of transistors on the die Ever-shrinking feature size leads to dense packing What would you do with so many transistors? Can invest some to cache, but beyond a certain point it doesn’t help Natural choice was to think about greater level of integration Few chip designers decided to bring the memory and coherence controllers along with the router on the die The next obvious choice was to replicate the entire core; it is fairly simple: just use the existing cores and connect them through a coherent interconnect