Objectives_template

Worse: false sharing

Communication cost

Given the total volume of communication (in bytes, say) the goal is to reduce the end-to-end latency
Simple model:
T = f*(o + L + (n / m) / B + tc – overlap) where
f = frequency of messages
o = overhead per message (at receiver and sender)
L = network delay per message (really the router delay)
n = total volume of communication in bytes
m = total number of messages
B = node-to-network bandwidth
t_c = contention-induced average latency per message
overlap = how much communication time is overlapped with useful computation

The goal is to reduce T
- Reduce o by communicating less: restructure algorithm to reduce m i.e. communicate larger messages (easy for message passing, but need extra support in memory controller for shared memory e.g., block transfer)
- Reduce L = number of average hops*time per hop
- Number of hops can be reduced by mapping the algorithm on the topology properly e.g., nearest neighbor communication is well-suited for a ring (just left/right) or a mesh (grid solver example); however, L is not very important because today routers are really fast (routing delay is ~10 ns); o and t_c are the dominant parts in T
- Reduce t_c by not creating hot-spots in the system: restructure algorithm to make sure a particular node does not get flooded with messages; distribute uniformly

Contention

It is very easy to ignore contention effects when designing algorithms
- Can severely degrade performance by creating hot-spots
Location hot-spot:
- Consider accumulating a global variable; the accumulation takes place on a single node i.e. all nodes access the variable allocated on that particular node whenever it tries to increment it