Objectives_template

Module 13: "Scalable Multiprocessors"

Lecture 28: "Scalable Multiprocessors"

	Caching shared data? All transactions are no longer visible to all Whether a page should be cached or not is normally part of the VA to PA translation For example, in some graphics co-processors all operations must be through uncached writes or storing command/data to memory-mapped control registers is also uncached Private memory lines can be cached without any problem and does not depend on if the line is local or remote Caching shared lines introduce coherence issues COWs and NOWs Historically, used to build a multi-programmed multi-user system Connect a number of PCs or workstations over a cheap commodity network and schedule independent jobs on machines depending on the load Increasingly, these clusters are being used as parallel machines One major reason is the availability of message passing libraries to express parallelism over commodity LAN or WAN Also, technology breakthrough in terms of high-speed interconnects (ServerNet, Myrinet, Infiniband, PCI Express AS, etc.) Varying specialized support in CA Conventional TCP/IP stack imposes an enormous overhead to move even a small amount of data (often more than common Ethernet): network processor architecture has been a hot research topic Active messages allow user-level communication Reflective memory allows writes to special regions of memory to appear as writes into regions on remote processors Virtual interface architecture (VIA): each process has a communication end point consisting of a send queue, receive queue, and status; a process can deposit a message in its send queue with a dest. id so that it actually gets into the receive queue of target process The CA hardware normally plugs on to I/O bus as opposed memory bus (fast PCI bus supports coherence); could be on the graphics bus also Scalable synchronization In message-passing a send/receive pair provides point-to-point synchronization Handle all-to-all synchronization via tree barrier Also, all-to-all communication must be properly staggered to avoid hot-spots in the system Classical example of matrix transpose Scalable locks such as ticket or array should be used Any problem with array locks? Array locations now may not be local: invalidation causes remote misses Distributed queue locks Goodman, Vernon, Woest (1989) Logically arrange the array as a linked list Allocate a new node (must be atomic) when a processor enters acquire The node is allocated on the local memory of the contending processor A tail pointer is always maintained