Module 4: Parallel Programming: Shared Memory and Message Passing
  Lecture 8: Optimizing Shared Memory Performance
 


Load Balancing

  • Achievable speedup is bounded above by
    • Sequential exec. time / Max. time for any processor
    • Thus speedup is maximized when the maximum time and minimum time across all processors are close (want to minimize the variance of parallel execution time)
    • This directly gets translated to load balancing
  • What leads to a high variance?
    • Ultimately all processors finish at the same time
    • But some do useful work all over this period while others may spend a significant time at synchronization points
    • This may arise from a bad partitioning
    • There may be other architectural reasons for load imbalance beyond the scope of a programmer e.g., network congestion, unforeseen cache conflicts etc. (slows down a few threads)

Dynamic Task Queues

  • Introduced in the last lecture
  • Normally implemented as part of the parallel program
  • Two possible designs
    • Centralized task queue: a single queue of tasks; may lead to heavy contention because insertion and deletion to/from the queue must be critical sections
    • Distributed task queues: one queue per processor
  • Issue with distributed task queues
    • When a queue of a particular processor is empty what does it do? Task stealing