| |
Agenda
- Basics of scalability
- Programming models
- Physical DMA
- User-level networking
- Dedicated message processing
- Shared physical address
- Cluster of workstations (COWs) and Network of workstations (NOWs)
- Scaling parallel software
Basics of scalability
- Main problem is the communication medium
- Busses don’t scale
- Need more wires that are not always shared by all
- Replace bus by a network of switches
- Distributed memory multiprocessors
- Each node has its own local memory
- To access remote memory, node sends a point-to-point message to the destination
- How to support efficient messaging?
- Main goal is to reduce the ratio of remote memory latency to local memory latency
- In shared memory, how to support coherence efficiently?
Bandwidth scaling
- Need a large number of independent paths between two nodes
- Makes it possible to support a large number of concurrent transactions
- They get initiated independently (as opposed to a single centralized bus arbiter)
- Local accesses should be higher bandwidth
- Since communication takes place via point-to-point messages, only the routers or switches along the path are involved
- No global visibility of messages (unlike a bus)
- Must send separate messages to make sure that global visibility is guaranteed when necessary (e.g., invalidations)
Latency scaling
- End-to-end latency of a message involves three parts (log model)
- Overhead (o): time to initiate a message and to terminate a message (at sender and receiver respectively); normally involves kernel overhead in message passing and the coherence overhead in shared memory
- Node-to-network time or gap (g): number of bytes/link bandwidth where this is the bandwidth offered by the router to/from network (how fast you can push packets into the network or pull packets from the network); normally the bandwidth between network interface (NI) and the router is at least as big and hence is not a bottleneck
- Routing time or hop time (L): determined by topology, routing algorithm, and router circuitry (e.g., arbitration, number of ports etc.)
- Importance: L < g < o for most scientific applications
Cost scaling
- Cost of anything has two components
- Fixed cost
- Incremental cost for adding something more (in our case more nodes)
- Bus-based SMPs have too much of fixed cost
- Scaling involves adding a commodity processor and possibly more DRAM
- Need to have more modular cost scaling i.e. don’t want to pay so much even for a small scale machine
- Costup = cost of P nodes / cost of single node
- Parallel computing on a machine is cost-effective if speedup > costup on average for target applications
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
|
|
|
|