|
2D to 4D Conversion
- Essentially you need to change the way memory is allocated
- The matrix A needs to be allocated in such a way that the elements falling within a partition are contiguous
- The first two dimensions of the new 4D matrix are block row and column indices i.e. for the partition assigned to processor P 6 these are 1 and 2 respectively (assuming 16 processors)
- The next two dimensions hold the data elements within that partition
- Thus the 4D array may be declared as float B[vP][vP][N/vP][N/vP]
- The element B[3][2][5][10] corresponds to the element in 10 th column, 5 th row of the partition of P 14
- Now all elements within a partition have contiguous addresses
Transfer Granularity
- How much data do you transfer in one communication?
- For message passing it is explicit in the program
- For shared memory this is really under the control of the cache coherence protocol: there is a fixed size for which transactions are defined (normally the block size of the outermost level of cache hierarchy)
- In shared memory you have to be careful
- Since the minimum transfer size is a cache line you may end up transferring extra data e.g., in grid solver the elements of the left and right neighbors for a square block decomposition (you need only one element, but must transfer the whole cache line): no good solution
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|