Module 8: "Performance Issues"
  Lecture 14: "Load Balancing and Domain Decomposition"
 

Agenda

  • Partitioning for performance
  • Data access and communication
  • Summary
  • Goal is to understand simple trade-offs involved in writing a parallel program keeping an eye on parallel performance
    • Getting good performance out of a multiprocessor is difficult
    • Programmers need to be careful
    • A little carelessness may lead to extremely poor performanc

Partitioning for perf.

  • Partitioning plays an important role in the parallel performance
    • This is where you essentially determine the tasks
  • A good partitioning should practise
    • Load balance
    • Minimal communication
    • Low overhead to determine and manage task assignment (sometimes called extra work)
  • A well-balanced parallel program automatically has low barrier or point-to-point synchronization time
    • Ideally I want all the threads to arrive at a barrier at the same time

Load balancing

  • Achievable speedup is bounded above by
    • Sequential exec. time / Max. time for any processor
    • Thus speedup is maximized when the maximum time and minimum time across all processors are close (want to minimize the variance of parallel execution time)
    • This directly gets translated to load balancing
  • What leads to a high variance?
    • Ultimately all processors finish at the same time
    • But some do useful work all over this period while others may spend a significant time at synchronization points
    • This may arise from a bad partitioning
    • There may be other architectural reasons for load imbalance beyond the scope of a programmer e.g., network congestion, unforeseen cache conflicts etc. (slows down a few threads)
  • Effect of decomposition/assignment on load balancing
    • Static partitioning is good when the nature of computation is predictable and regular
    • Dynamic partitioning normally provides better load balance, but has more runtime overhead for task management; also it may increase communication
    • Fine grain partitioning (extreme is one instruction per thread) leads to more overhead, but better load balance
    • Coarse grain partitioning (e.g., large tasks) may lead to load imbalance if the tasks are not well-balanced