Module 5: Performance Issues in Shared Memory and Introduction to Coherence
  Lecture 9: Performance Issues in Shared Memory
 


2D to 4D Conversion

  • Essentially you need to change the way memory is allocated
    • The matrix A needs to be allocated in such a way that the elements falling within a partition are contiguous
    • The first two dimensions of the new 4D matrix are block row and column indices i.e. for the partition assigned to processor P 6 these are 1 and 2 respectively (assuming 16 processors)
    • The next two dimensions hold the data elements within that partition
    • Thus the 4D array may be declared as float B[vP][vP][N/vP][N/vP]
    • The element B[3][2][5][10] corresponds to the element in 10 th column, 5 th row of the partition of P 14
    • Now all elements within a partition have contiguous addresses

Transfer Granularity

  • How much data do you transfer in one communication?
    • For message passing it is explicit in the program
    • For shared memory this is really under the control of the cache coherence protocol: there is a fixed size for which transactions are defined (normally the block size of the outermost level of cache hierarchy)
  • In shared memory you have to be careful
    • Since the minimum transfer size is a cache line you may end up transferring extra data e.g., in grid solver the elements of the left and right neighbors for a square block decomposition (you need only one element, but must transfer the whole cache line): no good solution