|
Extra work
- Extra work in a parallel version of a sequential program may result from
- Decomposition
- Assignment techniques
- Management of the task pool etc.
- Speedup is bounded above by
Sequential work / Max (Useful work + Synchronization + Comm. cost + Extra work) where the Max is taken over all processors
- But this is still incomplete
- We have only considered communication cost from the viewpoint of the algorithm and ignored the architecture completely
Data access and communication
- The memory hierarchy (caches and main memory) plays a significant role in determining communication cost
- May easily dominate the inherent communication of the algorithm
- For uniprocessor, the execution time of a program is given by useful work time + data access time
- Useful work time is normally called the busy time or busy cycles
- Data access time can be reduced either by architectural techniques (e.g., large caches) or by cache-aware algorithm design that exploits spatial and temporal locality
Data access
- In multiprocessors
- Every processor wants to see the memory interface as its own local cache and the main memory
- In reality it is much more complicated
- If the system has a centralized memory (e.g., SMPs), there are still caches of other processors; if the memory is distributed then some part of it is local and some is remote
- For shared memory, data movement from local or remote memory to cache is transparent while for message passing it is explicit
- View a multiprocessor as an extended memory hierarchy where the extension includes caches of other processors, remote memory modules and the network topology
|