|
- Program has high synchronization cost: n2 forks and synchronizations.
- Different processors may access c[i,j] and c[i,j+1] which may be on the same cache line.
Execute Outer Loop in Parallel
doall i := 1 to n do
for k := 1 to n do
for j := 1 to n do
c[i,j] = c[i,j] + a[i,k] ∗ b[k,j]
endfor
endfor
endall |
- Only one fork and synchronization
- Each task is large grain operation
- Each task fetches whole array B
SIMD Architecture
- Single front end issuing instructions
- Back ends or PEs execute each instruction
- Code divided into scalar and parallel code
- Scalar code executes on front end
- Each PE can access local memory directly
- Messages must be used to access values from another PE’s memory
- Assume each PE has one row of each matrix
- Variables 1A, 1B and 1C contain rows of A, B and C on each PE
- Fetch operation fetches 1B[j] from PE
- Bkj is front end variable
for j := 1 to n do
PE(0:n-1):ctmp = 0
for k := 1 to n do
Bkj = fetch(k-1)(1B[j])
PE(0:n-1):ctmp = ctmp + 1A(k) ∗ Bkj
endfor
PE(0:n-1):1C[j] = ctmp
endfor |
|