Objectives_template

	Agenda Basics of scalability Programming models Physical DMA User-level networking Dedicated message processing Shared physical address Cluster of workstations (COWs) and Network of workstations (NOWs) Scaling parallel software Scalable synchronization Basics of scalability Main problem is the communication medium Busses don’t scale Need more wires that are not always shared by all Replace bus by a network of switches Distributed memory multiprocessors Each node has its own local memory To access remote memory, node sends a point-to-point message to the destination How to support efficient messaging? Main goal is to reduce the ratio of remote memory latency to local memory latency In shared memory, how to support coherence efficiently? Bandwidth scaling Need a large number of independent paths between two nodes Makes it possible to support a large number of concurrent transactions They get initiated independently (as opposed to a single centralized bus arbiter) Local accesses should be higher bandwidth Since communication takes place via point-to-point messages, only the routers or switches along the path are involved No global visibility of messages (unlike a bus) Must send separate messages to make sure that global visibility is guaranteed when necessary (e.g., invalidations) Latency scaling End-to-end latency of a message involves three parts (log model) Overhead (o): time to initiate a message and to terminate a message (at sender and receiver respectively); normally involves kernel overhead in message passing and the coherence overhead in shared memory Node-to-network time or gap (g): number of bytes/link bandwidth where this is the bandwidth offered by the router to/from network (how fast you can push packets into the network or pull packets from the network); normally the bandwidth between network interface (NI) and the router is at least as big and hence is not a bottleneck Routing time or hop time (L): determined by topology, routing algorithm, and router circuitry (e.g., arbitration, number of ports etc.) Importance: L < g < o for most scientific applications Cost scaling Cost of anything has two components Fixed cost Incremental cost for adding something more (in our case more nodes) Bus-based SMPs have too much of fixed cost Scaling involves adding a commodity processor and possibly more DRAM Need to have more modular cost scaling i.e. don’t want to pay so much even for a small scale machine Costup = cost of P nodes / cost of single node Parallel computing on a machine is cost-effective if speedup > costup on average for target applications