|
Load and store can be floated out of k loop
for i = 1 to n do
setvl r1
loadv v4, (r4)
for k = 1 to n do
loadf f2, (r2)
loadv v3, (r3)
mpyvs v3, v3, f2
addvv v4, v4, v3
addi r2, r2, #4
add r3, r3, r13
endfor
strorev v4, (r4)
add r4, r4, r13
endfor |
;set vactor length to n
;load C[i,1:n]
;load A[i,k]
;load B[k,1:n]
;A[i,k]*B[k,1:n]
;update C[i,1:n]
;point to A[i,k+1]
;point to B[k+1,1]
;store C[i,1:n]
;point to C[i+1,1] |
Strip-mining
If n is larger than vector size, the code will not work.
To handle the general case the vector must be divided into strips of size m where m is no longer than a vector register. Assuming m=64
for i := 1 to n do
for k := 1 to n do
for js := 0 to n-1 by 64 do
vl := min(n-js, 64)
c[i,js+1:js+vl] = c[i,js+1:js+vl] +
a[i,k] ∗ b[k,js+1:js+vl]
endfor
endfor
endfor |
Shared Memory Model
- Discover iterations which can be executed in parallel
- Master processor executes task upto the parallel loop
- Fork tasks for each of processor
- Synchronize at the end of the loop
One way to parallelize matrix multiplication is:
for i := 1 to n do
for k := 1 to n do
doall j := 1 to n do
c[i,j] = c[i,j] + a[i,k] ∗ b[k,j]
endall
endfor
endfor |
|