Objectives_template

Module 18: "TLP on Chip: HT/SMT and CMP"

Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing"

Clustered arch.

An alternative to CMP is clustered microarchitecture
- Still tries to extract ILP and runs a single thread
- But divides the execution unit into clusters where each cluster has a separate register file
- Number of ports per register file goes down dramatically reducing the complexity
- Can even replicate/partition caches
- Big disadvantage: keeping the register file and cache partitions coherent; may need global wires
  - Key factor: frequency of communication
- Also, standard problems of single-threaded execution remain: branch prediction, fetch bandwidth, etc.

May want to steer dependent instructions to the same Cluster to minimize communication

ABCs of CMP

Where to put the interconnect?
- Do not want to access the interconnect too frequently because these wires are slow
- It probably does not make much sense to have the L1 cache shared among the cores: requires very high bandwidth and may necessitate a redesign of the L1 cache and surrounding load/store unit which we do not want to do; so settle for private L1 caches, one per core
- Makes more sense to share the L2 or L3 caches
- Need a coherence protocol at L2 interface to keep private L1 caches coherent: may use a high-speed custom designed snoopy bus connecting the L1 controllers or may use a simple directory protocol
- An entirely different design choice is not to share the cache hierarchy at all (dual-core AMD and Intel): rids you of the on-chip coherence protocol, but no gain in communication latency

Shared cache design

Need to be banked
- How many coherence engines per bank?
- Notion of home bank? Miss in home bank means what?
- Snoop or directory?
- COMA with home bank?

Hierarchical MP

SMT and CMP add couple more levels in hierarchical multiprocessor design
- If you just have an SMT processor, among the threads you can do shared memory multiprocessing with possibly the fastest communication; you can connect the SMT processors to build an SMP over a snoopy bus; you can connect these SMP nodes over a network with a directory protocol
- Can do the same thing with CMP, only difference is that you need to design the on-chip coherence logic (that is not automatically enforced as in SMT)
- If you have a CMP with each core being an SMT, then you really have a tall hierarchy of shared memory; the communication becomes costlier as you go up the hierarchy; also communication becomes very much non-uniform