Objectives_template

	Replication How is the shared data locally replicated? This is very important for reducing communication traffic In microprocessors data is replicated in the cache to reduce memory accesses In message passing, replication is explicit in the program and happens through receive (a private copy is created) In shared memory a load brings in the data to the cache hierarchy so that subsequent accesses can be fast; this is totally hidden from the program and therefore the hardware must provide a layer that keeps track of the most recent copies of the data (this layer is central to the performance of shared memory multiprocessors and is called the cache coherence protocol) Communication cost Three major components of the communication architecture that affect performance Latency: time to do an operation (e.g., load/store or send/recv.) Bandwidth: rate of performing an operation Overhead or occupancy: how long is the communication layer occupied doing an operation Latency Already a big problem for microprocessors Even bigger problem for multiprocessors due to remote operations Must optimize application or hardware to hide or lower latency (algorithmic optimizations or prefetching or overlapping computation with communication) Bandwidth How many ops in unit time e.g. how many bytes transferred per second Local BW is provided by heavily banked memory or faster and wider system bus Communication BW has two components: 1. node-to-network BW (also called network link BW) measures how fast bytes can be pushed into the router from the CA, 2. within-network bandwidth: affected by scalability of the network and architecture of the switch or router Linear cost model: Transfer time = T₀ + n/B where T₀ is start-up overhead, n is number of bytes transferred and B is BW Not sufficient since overlap of comp. and comm. is not considered; also does not count how the transfer is done (pipelined or not) Better model: Communication time for n bytes = Overhead + CA occupancy + Network latency + Size/BW + Contention T(n) = O_V + O_C + L + n/B + T_C Overhead and occupancy may be functions of n Contention depends on the queuing delay at various components along the communication path e.g. waiting time at the communication assist or controller, waiting time at the router etc. Overall communication cost = frequency of communication x (communication time – overlap with useful computation) Frequency of communication depends on various factors such as how the program is written or the granularity of communication supported by the underlying hardware ILP vs. TLP Microprocessors enhance performance of a sequential program by extracting parallelism from an instruction stream (called instruction-level parallelism) Multiprocessors enhance performance of an explicitly parallel program by running multiple threads in parallel (called thread-level parallelism) TLP provides parallelism at a much larger granularity compared to ILP In multiprocessors ILP and TLP work together Within a thread ILP provides performance boost Across threads TLP provides speedup over a sequential version of the parallel program