Module 13: "Scalable Multiprocessors"
  Lecture 28: "Scalable Multiprocessors"
 

User-level handling

  • Instead of mapping the network ports to user memory, make them processor registers
    • Even faster messaging
    • Communication assist now looks really like a functional unit inside the processor (just like a FPU)
    • Send and receive are now register to register transfers
    • iWARP from CMU and Intel, *T from MIT and Motorola, J machine from MIT
    • iWARP binds two processor registers to the heads of the network input and output ports; the processor accesses the message word-by-word as it streams in
    • *T extended Motorola 88110 RISC core to include a network function unit containing dedicated sets of input and output registers; a message is spread over a set of such registers and a special instruction initiates the transfer

Message co-processor

  • Nodes equipped with a dedicated message processor or communication processor (CP)
    • Two possible organizations: main processor and CP sit on a shared memory bus along with the main memory and NI; otherwise the CP may be integrated into the NI
    • The main processor and CP talk to each other via the normal cache coherence protocol i.e. while sending a message the main processor fills a shared buffer and sets a flag and while receiving a message CP does the same thing
    • Possible inefficiency due to invalidation-based coherence protocol (Update protocol would be worse)
    • CP may need to handle a lot of concurrent transactions e.g., from main processor and from network: a single dispatch loop serializes the processing (multi-threaded CP?)

Intel Paragon

  • One i860XP processor per SMP node (MESI) is dedicated as CP
  • One receive and one transmit DMA engine for transferring data from/to shared memory to/from NI transmit/receive queue (each 2 KB)
  • While sending a large message the NI queue may become full and the network may not drain that fast
    • To avoid deadlock the transmit DMA is stalled by hardware flow control and the bus is relinquished
  • Takes about 10 μs to send a small message (about two cache lines) from the register file of source to the register file of destination

Meiko CS-2

  • The CP is tightly integrated with the NI and has separate concurrent units
    • The command unit sits directly on the shared bus and is responsible for fielding processor requests
    • The processor executes an atomic swap between one register and a fixed memory location which is mapped to the head of the CP input queue
    • The command contains a command type and a VA
    • Depending on the command type the command processor can invoke the DMA processor (may need assistance from VA to PA unit), an event processor (to wake up some thread on the main processor), or a thread processor to construct and issue a message
    • The input processor fields new messages from the network and may invoke the reply processor, or any of the above three units

Shared physical addr.

  • Memory controller on each node accepts PAs from the system bus
    • The processor initially issues a VA
    • The TLB provides the translation and the upper few bits of PA represent the home node for this address (determined when the mapping is established for the first time)
    • If the address is local i.e. requester is the home node, the memory controller returns data just as in uniprocessor
    • If address is remote the memory controller instructs the communication assist (essentially the NI) to generate a remote memory request
    • In the remote home the CA issues a request to the memory controller to read memory and eventually data is returned to the requester