Objectives_template

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS

Lecture 26: SIMD Architecture

Program has high synchronization cost: n2 forks and synchronizations.
Different processors may access c[i,j] and c[i,j+1] which may be on the same cache line.

Execute Outer Loop in Parallel

doall i := 1 to n do for k := 1 to n do for j := 1 to n do c[i,j] = c[i,j] + a[i,k] ∗ b[k,j] endfor endfor endall

Only one fork and synchronization
Each task is large grain operation
Each task fetches whole array B

SIMD Architecture

Single front end issuing instructions
Back ends or PEs execute each instruction
Code divided into scalar and parallel code
Scalar code executes on front end
Each PE can access local memory directly
Messages must be used to access values from another PE’s memory
Assume each PE has one row of each matrix
Variables 1A, 1B and 1C contain rows of A, B and C on each PE
Fetch operation fetches 1B[j] from PE
Bkj is front end variable

for j := 1 to n do
PE(0:n-1):ctmp = 0
for k := 1 to n do
Bkj = fetch(k-1)(1B[j])
PE(0:n-1):ctmp = ctmp + 1A(k) ∗ Bkj
endfor
PE(0:n-1):1C[j] = ctmp
endfor