Module 13: "Scalable Multiprocessors"
  Lecture 28: "Scalable Multiprocessors"
 

Agenda

  • Basics of scalability
  • Programming models
  • Physical DMA
  • User-level networking
  • Dedicated message processing
  • Shared physical address
  • Cluster of workstations (COWs) and Network of workstations (NOWs)
  • Scaling parallel software
    • Scalable synchronization

Basics of scalability

  • Main problem is the communication medium
    • Busses don’t scale
    • Need more wires that are not always shared by all
    • Replace bus by a network of switches
  • Distributed memory multiprocessors
    • Each node has its own local memory
    • To access remote memory, node sends a point-to-point message to the destination
    • How to support efficient messaging?
    • Main goal is to reduce the ratio of remote memory latency to local memory latency
    • In shared memory, how to support coherence efficiently?

Bandwidth scaling

  • Need a large number of independent paths between two nodes
    • Makes it possible to support a large number of concurrent transactions
    • They get initiated independently (as opposed to a single centralized bus arbiter)
  • Local accesses should be higher bandwidth
  • Since communication takes place via point-to-point messages, only the routers or switches along the path are involved
    • No global visibility of messages (unlike a bus)
    • Must send separate messages to make sure that global visibility is guaranteed when necessary (e.g., invalidations)

Latency scaling

  • End-to-end latency of a message involves three parts (log model)
    • Overhead (o): time to initiate a message and to terminate a message (at sender and receiver respectively); normally involves kernel overhead in message passing and the coherence overhead in shared memory
    • Node-to-network time or gap (g): number of bytes/link bandwidth where this is the bandwidth offered by the router to/from network (how fast you can push packets into the network or pull packets from the network); normally the bandwidth between network interface (NI) and the router is at least as big and hence is not a bottleneck
    • Routing time or hop time (L): determined by topology, routing algorithm, and router circuitry (e.g., arbitration, number of ports etc.)
    • Importance: L < g < o for most scientific applications

Cost scaling

  • Cost of anything has two components
    • Fixed cost
    • Incremental cost for adding something more (in our case more nodes)
  • Bus-based SMPs have too much of fixed cost
    • Scaling involves adding a commodity processor and possibly more DRAM
    • Need to have more modular cost scaling i.e. don’t want to pay so much even for a small scale machine
  • Costup = cost of P nodes / cost of single node
  • Parallel computing on a machine is cost-effective if speedup > costup on average for target applications