Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS
  Lecture 26: SIMD Architecture
 
  • Program has high synchronization cost: n2 forks and synchronizations.
  • Different processors may access c[i,j] and c[i,j+1] which may be on the same cache line.

Execute Outer Loop in Parallel

doall i := 1 to n do
for k := 1 to n do
for j := 1 to n do
c[i,j] = c[i,j] + a[i,k] ∗ b[k,j]
endfor
endfor
endall
  • Only one fork and synchronization
  • Each task is large grain operation
  • Each task fetches whole array B

SIMD Architecture

  • Single front end issuing instructions
  • Back ends or PEs execute each instruction
  • Code divided into scalar and parallel code
  • Scalar code executes on front end
  • Each PE can access local memory directly
  • Messages must be used to access values from another PE’s memory
  • Assume each PE has one row of each matrix
  • Variables 1A, 1B and 1C contain rows of A, B and C on each PE
  • Fetch operation fetches 1B[j] from PE
  • Bkj is front end variable
for j := 1 to n do
PE(0:n-1):ctmp = 0
for k := 1 to n do
Bkj = fetch(k-1)(1B[j])
PE(0:n-1):ctmp = ctmp + 1A(k) ∗ Bkj
endfor
PE(0:n-1):1C[j] = ctmp
endfor