Module 7: "Parallel Programming"
  Lecture 12: "Steps in Writing a Parallel Program"
 

Static assignment

  • Given a decomposition it is possible to assign tasks statically
    • For example, some computation on an array of size N can be decomposed statically by assigning a range of indices to each process: for k processes P0 operates on indices 0 to (N/k)-1, P1 operates on N/k to (2N/k)-1,…, Pk-1 operates on (k-1)N/k to N-1
    • For regular computations this works great: simple and low-overhead
  • What if the nature computation depends on the index?
    • For certain index ranges you do some heavy-weight computation while for others you do something simple
    • Is there a problem?

Dynamic assignment

  • Static assignment may lead to load imbalance depending on how irregular the application is
  • Dynamic decomposition/assignment solves this issue by allowing a process to dynamically choose any available task whenever it is done with its previous task
    • Normally in this case you decompose the program in such a way that the number of available tasks is larger than the number of processes
    • Same example: divide the array into portions each with 10 indices; so you have N/10 tasks
    • An idle process grabs the next available task
    • Provides better load balance since longer tasks can execute concurrently with the smaller ones
  • Dynamic assignment comes with its own overhead
    • Now you need to maintain a shared count of the number of available tasks
    • The update of this variable must be protected by a lock
    • Need to be careful so that this lock contention does not outweigh the benefits of dynamic decomposition
  • More complicated applications where a task may not just operate on an index range, but could manipulate a subtree or a complex data structure
    • Normally a dynamic task queue is maintained where each task is probably a pointer to the data
    • The task queue gets populated as new tasks are discovered

Decomposition types

  • Decomposition by data
    • The most commonly found decomposition technique
    • The data set is partitioned into several subsets and each subset is assigned to a process
    • The type of computation may or may not be identical on each subset
    • Very easy to program and manage
  • Computational decomposition
    • Not so popular: tricky to program and manage
    • All processes operate on the same data, but probably carry out different kinds of computation
    • More common in systolic arrays, pipelined graphics processor units (GPUs) etc.

Orchestration

  • Involves structuring communication and synchronization among processes, organizing data structures to improve locality, and scheduling tasks
    • This step normally depends on the programming model and the underlying architecture
  • Goal is to
    • Reduce communication and synchronization costs
    • Maximize locality of data reference
    • Schedule tasks to maximize concurrency: do not schedule dependent tasks in parallel
    • Reduce overhead of parallelization and concurrency management (e.g., management of the task queue, overhead of initiating a task etc.)