Module 18: "TLP on Chip: HT/SMT and CMP"
  Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing"
 

SMT

  • Discussed simultaneous multithreading (SMT)
    • Basic goal is to run multiple threads at the same time
    • Helps in hiding large memory latency because even if one thread is blocked due to a cache miss, it is still possible to schedule ready instructions from other threads without taking the overhead of context switch
    • Improves memory level parallelism (MLP)
    • Overall, improves resource utilization enormously as compared to a superscalar processor
    • Latency of a particular thread may not improve, but the overall throughput of the system increases (i.e. average number of retired instructions per cycle)

Multi-threading

  • Three design choices for single-core hardware multi-threading
    • Coarse-grain multithreading: Execute one thread at a time; when the running thread is blocked on a long-latency event e.g., cache miss, swap in a new thread; this swap can take place in hardware (needs extra support and extra cycles for flushing the pipe and saving register values unless renamed registers remain pinned)
    • Fine-grain multithreading: Fetch, decode, rename, issue, execute instructions from threads in round robin fashion; improved utilization across cycles, but problem remains within cycle; also if a thread gets blocked on a long-latency event its slots will go wasted for many cycles
    • Simultaneous multithreading (SMT): Mix instructions from all threads every cycle; maximum utilization of resources

Problems of SMT

  • Offers a processor that can deliver reasonably good multithreaded performance with fine-grained fast communication through cache
    • Although it is possible to design an SMT processor with small die area increase (5% in Pentium 4), for good performance it is necessary to rethink about resource allocation policies at various stages of the pipe
    • Also, verifying an SMT processor is much harder than the basic underlying superscalar design
    • Must think about various deadlock/livelock possibilities since the threads interact with each other through shared resources on a per-cycle basis
    • Why not exploit the transistors available today to just replicate existing superscalar cores and design a single chip multiprocessor (CMP)?

CMP

  • CMP is the mantra of today’s microprocessor industry
    • Intel’s dual-core Pentium 4: each core is still hyperthreaded (just uses existing cores)
    • Intel’s quad-core Whitefield is coming up in a year or so
    • For the server market Intel has announced a dual-core Itanium 2 (code named Montecito); again each core is 2-way threaded
    • AMD has released dual-core Opteron in 2005
    • IBM released their first dual-core processor POWER4 circa 2001; next-generation POWER5 also uses two cores but each core is also 2-way threaded
    • Sun’s UltraSPARC IV (released in early 2004) is a dual-core processor and integrates two UltraSPARC III cores

Why CMP?

  • Today microprocessor designers can afford to have a lot of transistors on the die
    • Ever-shrinking feature size leads to dense packing
    • What would you do with so many transistors?
    • Can invest some to cache, but beyond a certain point it doesn’t help
    • Natural choice was to think about greater level of integration
    • Few chip designers decided to bring the memory and coherence controllers along with the router on the die
    • The next obvious choice was to replicate the entire core; it is fairly simple: just use the existing cores and connect them through a coherent interconnect