Objectives_template

	Extra work Extra work in a parallel version of a sequential program may result from Decomposition Assignment techniques Management of the task pool etc. Speedup is bounded above by Sequential work / Max (Useful work + Synchronization + Comm. cost + Extra work) where the Max is taken over all processors But this is still incomplete We have only considered communication cost from the viewpoint of the algorithm and ignored the architecture completely Data access and communication The memory hierarchy (caches and main memory) plays a significant role in determining communication cost May easily dominate the inherent communication of the algorithm For uniprocessor, the execution time of a program is given by useful work time + data access time Useful work time is normally called the busy time or busy cycles Data access time can be reduced either by architectural techniques (e.g., large caches) or by cache-aware algorithm design that exploits spatial and temporal locality Data access In multiprocessors Every processor wants to see the memory interface as its own local cache and the main memory In reality it is much more complicated If the system has a centralized memory (e.g., SMPs), there are still caches of other processors; if the memory is distributed then some part of it is local and some is remote For shared memory, data movement from local or remote memory to cache is transparent while for message passing it is explicit View a multiprocessor as an extended memory hierarchy where the extension includes caches of other processors, remote memory modules and the network topology