|
Domain decomposition
- Normally applications show a local bias on data usage
- Communication is short-range e.g. nearest neighbor
- Even if it is long-range it falls off with distance
- View the dataset of an application as the domain of the problem e.g., the 2-D grid in equation solver
- If you consider a point in this domain, in most of the applications it turns out that this point depends on points that are close by
- Partitioning can exploit this property by assigning contiguous pieces of data to each process
- Exact shape of decomposed domain depends on the application and load balancing requirements
Comm-to-comp ratio
- Surely, there could be many different domain decompositions for a particular problem
- For grid solver we may have a square block decomposition, block row decomposition or cyclic row decomposition
- How to determine which one is good? Communication-to-computation ratio
Assume P processors and NxN grid for grid solver
|
Size of each block: N/√P by N/√P |
Communication (perimeter): 4N/√P |
Computation (area): N2/P |
Comm-to-comp ratio = 4√P/N |
Sq. block decomp. for P=16
- For block row decomposition
- Each strip has N/P rows
- Communication (boundary rows): 2N
- Computation (area): N2/P (same as square block)
- Comm-to-comp ratio: 2P/N
- For cyclic row decomposition
- Each processor gets N/P isolated rows
- Communication: 2N2/P
- Computation: N2/P
- Comm-to-comp ratio: 2
- Normally N is much much larger than P
- Asymptotically, square block yields lowest comm-to-comp ratio
- Idea is to measure the volume of inherent communication per computation
- In most cases it is beneficial to pick the decomposition with the lowest comm-to-comp ratio
- But depends on the application structure i.e. picking the lowest comm-to-comp may have other problems
- Normally this ratio gives you a rough estimate about average communication bandwidth requirement of the application i.e. how frequent is communication
- But it does not tell you the nature of communication i.e. bursty or uniform
- For grid solver comm. happens only at the start of each iteration; it is not uniformly distributed over computation
- Thus the worst case BW requirement may exceed the average comm-to-comp ratio
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|