Objectives_template

	What is needed? On every memory operation Find the state of the cache line (normally present in cache) Tell others about the operation if needed (achieved by broadcasting in bus-based or small-scale distributed systems) Other processors must take appropriate actions Also, need a mechanism to resolve races between concurrent conflicting accesses (i.e. one of them is a write): this essentially needs some central control on a per cache line basis Atomic bus provides an easy way of serializing Split-transaction bus with a distributed request table works only because every request table can see every transaction Need to have a table that gets accessed/updated by every cache line access This is the directory Every cache line has a separate directory entry The directory entry stores the state of the line, who the current owner is (if any), the sharers (if any), etc. On a miss, the directory entry must be located, and appropriate coherence action must be taken A popular architecture is to have a two-level hierarchy: each node is SMP, kept coherent via snoopy or directory protocol, and nodes are kept coherent by a scalable directory protocol (Convex Exemplar: directory-directory, Sequent, Data General, HAL, DASH: snoopy-directory) Adv. of MP nodes Amortization of node fixed cost over multiple processors; can use commodity SMPs Much communication may be contained within a node i.e. less “remote” communication Request combining by some extra hardware in memory controller Possible to share caches e.g., chip multiprocessor nodes (IBM POWER4 and POWER5) or hyper-threaded nodes (Intel Xeon MP) Exact benefit depends on sharing pattern Widely shared data or nearest neighbor (if properly mapped) may be good Disadvantages Snoopy bus delays all accesses The local snoop must complete first Then only a request can be sent to remote home Same delay may be incurred at the remote home also depending on the coherence scheme This dictated SGI Origin 2000 to have dual processor nodes, but managed entirely by director Bandwidth at critical points is shared by all processors System bus, memory controller, DRAM, router Bad communication patterns can actually result in execution time larger than uniprocessor nodes even though average “hop” time may be larger e.g., compare two 16P systems one with 4-way 4 nodes and one with 16 nodess