Module 13: "Scalable Multiprocessors"
  Lecture 28: "Scalable Multiprocessors"
 

Common challenges

  • Input buffer overflow
    • Reserve space per source
    • Refuse input when full
    • Let the network backup naturally: tree saturation
    • Deadlock-free networks
    • Traffic not bound to hot-spot nodes may get severely affected
    • Keep a reserved NACK channel
    • May drop packets depending on the network protocol
  • Fetch deadlock
    • Nodes must continue to serve new messages while waiting for queue space so that it can put new requests
    • Separate request and response virtual networks (essentially disjoint set of queues in each port of the router) or have enough buffer space to never run into this problem (may be too expensive)

Spectrum of designs

  • In increasing order of hardware support and probably performance and cost
    • Physical bit stream, physical DMA (nCUBE, iPSC)
    • User-level network port (CM-5, MIT *T)
    • User-level handler (MIT J machine, Monsoon)
    • Remote virtual address (Intel Paragon, Meiko CS-2)
    • Global physical address (Cray T3D, T3E)
    • Cache-coherent shared memory (SGI Origin, Alpha GS320, Sun S3.mp)

Physical DMA

  • A reserved area in physical memory is used for sending and receiving messages
    • After setting up the memory region the processor takes a trap to the kernel
    • The interrupt handler typically copies the data into kernel area so that it can be manipulated
    • Finally, kernel instructs the DMA device to push the message into the network via the physical address of the message (typically called DMA channel address)
    • At the destination the DMA device will deposit the message in a predefined physical memory area and generates an interrupt for the processor
    • The interrupt handler now can copy the message into kernel area and inspect and parse the message to take appropriate actions (this is called blind deposit)

nCUBE/2

  • Independent DMA channels per link direction
  • Segmented messages (first segment can be inspected to decide what to do with the rest)
  • Active messages: 13 μs outbound, 15 μs inbound
  • Dimension-order routing on hypercube

User-level ports

  • Network ports and status registers are memory-mapped in user address space
    • User program can initiate a transaction by composing the message and writing to the status registers
    • Communication assist does the protection check and pushes the message into the physical medium
    • A message at the destination can sit in the input queue until the user program pops it off
    • A system message generates an interrupt through the destination assist, but user messages do not require OS intervention
    • Problem with context switch: messages are now really part of the process state; need to save and restore them
    • Thinking machine CM-5 has outbound message latency of 1.5 μs and inbound 1.6 μs