Module 8: "Performance Issues"
  Lecture 15: "Locality and Communication Optimizations"
 

Worse: false sharing

  • If the algorithm is designed so poorly that
    • Two processors write to two different words within a cache line at the same time
    • The cache line keeps on moving between two processors
    • The processors are not really accessing or updating the same element, but whatever they are updating happen to fall within a cache line: not a true sharing, but false sharing
    • For shared memory programs false sharing can easily degrade performance by a lot
    • Easy to avoid: just pad up to the end of the cache line before starting the allocation of the data for the next processor (wastes memory, but improves performance)

Communication cost

  • Given the total volume of communication (in bytes, say) the goal is to reduce the end-to-end latency
  • Simple model:

    T = f*(o + L + (n / m) / B + tc – overlap) where
    f = frequency of messages
    o = overhead per message (at receiver and sender)
    L = network delay per message (really the router delay)
    n = total volume of communication in bytes
    m = total number of messages
    B = node-to-network bandwidth
    tc = contention-induced average latency per message
    overlap = how much communication time is overlapped with useful computation

  • The goal is to reduce T
    • Reduce o by communicating less: restructure algorithm to reduce m i.e. communicate larger messages (easy for message passing, but need extra support in memory controller for shared memory e.g., block transfer)
    • Reduce L = number of average hops*time per hop
    • Number of hops can be reduced by mapping the algorithm on the topology properly e.g., nearest neighbor communication is well-suited for a ring (just left/right) or a mesh (grid solver example); however, L is not very important because today routers are really fast (routing delay is ~10 ns); o and tc are the dominant parts in T
    • Reduce tc by not creating hot-spots in the system: restructure algorithm to make sure a particular node does not get flooded with messages; distribute uniformly

Contention

  • It is very easy to ignore contention effects when designing algorithms
    • Can severely degrade performance by creating hot-spots
  • Location hot-spot:
    • Consider accumulating a global variable; the accumulation takes place on a single node i.e. all nodes access the variable allocated on that particular node whenever it tries to increment it