|
What is needed?
- On every memory operation
- Find the state of the cache line (normally present in cache)
- Tell others about the operation if needed (achieved by broadcasting in bus-based or small-scale distributed systems)
- Other processors must take appropriate actions
- Also, need a mechanism to resolve races between concurrent conflicting accesses (i.e. one of them is a write): this essentially needs some central control on a per cache line basis
- Atomic bus provides an easy way of serializing
- Split-transaction bus with a distributed request table works only because every request table can see every transaction
- Need to have a table that gets accessed/updated by every cache line access
- This is the directory
- Every cache line has a separate directory entry
- The directory entry stores the state of the line, who the current owner is (if any), the sharers (if any), etc.
- On a miss, the directory entry must be located, and appropriate coherence action must be taken
- A popular architecture is to have a two-level hierarchy: each node is SMP, kept coherent via snoopy or directory protocol, and nodes are kept coherent by a scalable directory protocol (Convex Exemplar: directory-directory, Sequent, Data General, HAL, DASH: snoopy-directory)
Adv. of MP nodes
- Amortization of node fixed cost over multiple processors; can use commodity SMPs
- Much communication may be contained within a node i.e. less “remote” communication
- Request combining by some extra hardware in memory controller
- Possible to share caches e.g., chip multiprocessor nodes (IBM POWER4 and POWER5) or hyper-threaded nodes (Intel Xeon MP)
- Exact benefit depends on sharing pattern
- Widely shared data or nearest neighbor (if properly mapped) may be good
Disadvantages
- Snoopy bus delays all accesses
- The local snoop must complete first
- Then only a request can be sent to remote home
- Same delay may be incurred at the remote home also depending on the coherence scheme
- This dictated SGI Origin 2000 to have dual processor nodes, but managed entirely by director
- Bandwidth at critical points is shared by all processors
- System bus, memory controller, DRAM, router
- Bad communication patterns can actually result in execution time larger than uniprocessor nodes even though average “hop” time may be larger e.g., compare two 16P systems one with 4-way 4 nodes and one with 16 nodess
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|