Module 8: "Performance Issues"
  Lecture 14: "Load Balancing and Domain Decomposition"
 

Extra work

  • Extra work in a parallel version of a sequential program may result from
    • Decomposition
    • Assignment techniques
    • Management of the task pool etc.
  • Speedup is bounded above by
    Sequential work / Max (Useful work + Synchronization + Comm. cost + Extra work)
    where the Max is taken over all processors
  • But this is still incomplete
    • We have only considered communication cost from the viewpoint of the algorithm and ignored the architecture completely

Data access and communication

  • The memory hierarchy (caches and main memory) plays a significant role in determining communication cost
    • May easily dominate the inherent communication of the algorithm
  • For uniprocessor, the execution time of a program is given by useful work time + data access time
    • Useful work time is normally called the busy time or busy cycles
    • Data access time can be reduced either by architectural techniques (e.g., large caches) or by cache-aware algorithm design that exploits spatial and temporal locality

Data access

  • In multiprocessors
    • Every processor wants to see the memory interface as its own local cache and the main memory
    • In reality it is much more complicated
    • If the system has a centralized memory (e.g., SMPs), there are still caches of other processors; if the memory is distributed then some part of it is local and some is remote
    • For shared memory, data movement from local or remote memory to cache is transparent while for message passing it is explicit
    • View a multiprocessor as an extended memory hierarchy where the extension includes caches of other processors, remote memory modules and the network topology