Objectives_template

	Point-to-point Synch. Normally done in software with flags P0: A = 1; flag = 1; P1: while (!flag); print A; Some old machines supported full/empty bits in memory Each memory location is augmented with a full/empty bit Producer writes the location only if bit is reset Consumer reads location if bit is set and resets it Lot less flexible: one producer-one consumer sharing only (one producer-many consumers is very popular); all accesses to a memory location become synchronized (unless compiler flags some accesses as special) Possible optimization for shared memory Allocate flag and data structures (if small) guarded by flag in same cache line e.g., flag and A in above example Barrier High-level classification of barriers Hardware and software barriers Will focus on two types of software barriers Centralized barrier: every processor polls a single count Distributed tree barrier: shows much better scalability Performance goals of a barrier implementation Low latency: After all processors have arrived at the barrier, they should be able to leave quickly Low traffic: Minimize bus transaction and contention Scalability: Latency and traffic should scale slowly with the number of processors Low storage: Barrier state should not be big Fairness: Preserve some strict order of barrier exit (could be FIFO according to arrival order); a particular processor should not always be the last one to exit