Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS
  Lecture 25: Supercomputing Applications
 


Load and store can be floated out of k loop

for i = 1 to n do
setvl r1
loadv v4, (r4)
for k = 1 to n do
loadf f2, (r2)
loadv v3, (r3)
mpyvs v3, v3, f2
addvv v4, v4, v3
addi r2, r2, #4
add r3, r3, r13
endfor
strorev v4, (r4)
add r4, r4, r13
endfor
;set vactor length to n
;load C[i,1:n]

;load A[i,k]
;load B[k,1:n]
;A[i,k]*B[k,1:n]
;update C[i,1:n]
;point to A[i,k+1]
;point to B[k+1,1]

;store C[i,1:n]
;point to C[i+1,1]

Strip-mining

If n is larger than vector size, the code will not work.
To handle the general case the vector must be divided into strips of size m where m is no longer than a vector register. Assuming m=64

for i := 1 to n do
for k := 1 to n do
for js := 0 to n-1 by 64 do
vl := min(n-js, 64)
c[i,js+1:js+vl] = c[i,js+1:js+vl] +
a[i,k] ∗ b[k,js+1:js+vl]
endfor
endfor
endfor

Shared Memory Model

  • Discover iterations which can be executed in parallel
  • Master processor executes task upto the parallel loop
    • Fork tasks for each of processor
    • Synchronize at the end of the loop

One way to parallelize matrix multiplication is:

for i := 1 to n do
for k := 1 to n do
doall j := 1 to n do
c[i,j] = c[i,j] + a[i,k] ∗ b[k,j]
endall
endfor
endfor