Module 8: "Performance Issues"
  Lecture 15: "Locality and Communication Optimizations"
 

Spatial locality

  • Consider a square block decomposition of grid solver and a C-like row major layout i.e. A[i][j] and A[i][j+1] have contiguous memory locations

    The same page is local to a processor while remote to others; same applies to straddling cache lines. Ideally, I want to have all pages within a partition local to a single processor. Standard trick is to covert the 2D array to 4D.

2D to 4D conversion

  • Essentially you need to change the way memory is allocated
    • The matrix A needs to be allocated in such a way that the elements falling within a partition are contiguous
    • The first two dimensions of the new 4D matrix are block row and column indices i.e. for the partition assigned to processor P6 these are 1 and 2 respectively (assuming 16 processors)
    • The next two dimensions hold the data elements within that partition
    • Thus the 4D array may be declared as float B[√P][√P][N/√P][N/√P]
    • The element B[3][2][5][10] corresponds to the element in 10th column, 5th row of the partition of P14
    • Now all elements within a partition have contiguous addresses
  • Clearly, naming requires some support from hw and OS
    • Need to make sure that the accessed virtual address gets translated to the correct physical address

Transfer granularity

  • How much data do you transfer in one communication?
    • For message passing it is explicit in the program
    • For shared memory this is really under the control of the cache coherence protocol: there is a fixed size for which transactions are defined (normally the block size of the outermost level of cache hierarchy)
  • In shared memory you have to be careful
    • Since the minimum transfer size is a cache line you may end up transferring extra data e.g., in grid solver the elements of the left and right neighbors for a square block decomposition (you need only one element, but must transfer the whole cache line): no good solution