Module 13: "Scalable Multiprocessors"
  Lecture 28: "Scalable Multiprocessors"
 

Caching shared data?

  • All transactions are no longer visible to all
    • Whether a page should be cached or not is normally part of the VA to PA translation
    • For example, in some graphics co-processors all operations must be through uncached writes or storing command/data to memory-mapped control registers is also uncached
    • Private memory lines can be cached without any problem and does not depend on if the line is local or remote
    • Caching shared lines introduce coherence issues

COWs and NOWs

  • Historically, used to build a multi-programmed multi-user system
    • Connect a number of PCs or workstations over a cheap commodity network and schedule independent jobs on machines depending on the load
  • Increasingly, these clusters are being used as parallel machines
    • One major reason is the availability of message passing libraries to express parallelism over commodity LAN or WAN
    • Also, technology breakthrough in terms of high-speed interconnects (ServerNet, Myrinet, Infiniband, PCI Express AS, etc.)
  • Varying specialized support in CA
    • Conventional TCP/IP stack imposes an enormous overhead to move even a small amount of data (often more than common Ethernet): network processor architecture has been a hot research topic
    • Active messages allow user-level communication
    • Reflective memory allows writes to special regions of memory to appear as writes into regions on remote processors
    • Virtual interface architecture (VIA): each process has a communication end point consisting of a send queue, receive queue, and status; a process can deposit a message in its send queue with a dest. id so that it actually gets into the receive queue of target process
    • The CA hardware normally plugs on to I/O bus as opposed memory bus (fast PCI bus supports coherence); could be on the graphics bus also

Scalable synchronization

  • In message-passing a send/receive pair provides point-to-point synchronization
  • Handle all-to-all synchronization via tree barrier
  • Also, all-to-all communication must be properly staggered to avoid hot-spots in the system
    • Classical example of matrix transpose
  • Scalable locks such as ticket or array should be used
  • Any problem with array locks?
    • Array locations now may not be local: invalidation causes remote misses

Distributed queue locks

  • Goodman, Vernon, Woest (1989)
    • Logically arrange the array as a linked list
    • Allocate a new node (must be atomic) when a processor enters acquire
    • The node is allocated on the local memory of the contending processor
    • A tail pointer is always maintained