Module 12: "Multiprocessors on a Snoopy Bus"
  Lecture 27: "Scalable Snooping and AMD Hammer Protocol"
 

TLB coherence

  • A page table entry (PTE) may be held in multiple processors in shared memory because all of them access the same shared page
    • A PTE may get modified when the page is swapped out and/or access permissions are changed
    • Must tell all processors having this PTE to invalidate
    • How to do it efficiently?
  • No TLB: virtually indexed virtually tagged L1 caches
    • On L1 miss directly access PTE in memory and bring it to cache; then use normal cache coherence because the PTEs also reside in the shared memory segment
    • On page replacement the page fault handler can flush the cache line containing the replaced PTE
    • Too impractical: fully virtual caches are rare, still uses a TLB for upper levels (Alpha 21264 instruction cache)
  • Hardware solution
    • Extend snoop logic to handle TLB coherence
    • PowerPC family exercises a tlbie instruction (TLB invalidate entry)
    • When OS modifies a PTE it puts a tlbie instruction on bus
    • Snoop logic picks it up and invalidates the TLB entry if present in all processors
    • This is well suited for bus-based SMPs, but not for DSMs because broadcast in a large-scale machine is not good

TLB shootdown

  • Popular TLB coherence solution
    • Invoked by an initiator (the processor which modifies the PTE) by sending interrupt to processors that might be caching the PTE in TLBs; before doing so OS also locks the involved PTE to avoid any further access to it in case of TLB misses from other processors
    • The receiver of the interrupt simply invalidates the involved PTE if it is present in its TLB and sets a flag in shared memory on which the initiator polls
    • On completion the initiator unlocks the PTE
    • SGI Origin uses a lazy TLB shootdown i.e. it invalidates a TLB entry only when a processor tries to access it next time (will discuss in detail)

Snooping on a ring

  • Length of the bus limits the frequency at which it can be clocked which in turn limits the bandwidth offered by the bus leading to a limited number of processors
  • A ring interconnect provides a better solution
    • Connect a processor only to its two neighbors
    • Short wires, much higher switching frequency, better bandwidth, more processors
    • Each node has private local memory (more like a distributed shared memory multiprocessor)
    • Every cache line has a home node i.e. the node where the memory contains this line: can be determined by upper few bits of the PA
    • Transactions traverse the ring node by node
  • Snoop mechanism
    • When a transaction passes by the ring interface of a node it snoops the transaction, takes appropriate coherence actions, and forwards the transaction to its neighbor if necessary
    • The home node also receives the transaction eventually and let’s assume that it has a dirty bit associated with every memory line (otherwise you need a two-phase protocol)
    • A request transaction is removed from the ring when it comes back to the requester (serves as an acknowledgment that every node has seen the request)
    • The ring is essentially divided into time slots where a node can insert new request or response; if there is no free time slot it must wait until one passes by: called a slotted ring
  • The snoop logic must be able to finish coherence actions for a transaction before the next time slot arrives
  • The main problem of a ring is the end-to-end latency, since the transactions must traverse hop-by-hop
  • Serialization and sequential consistency is trickier
    • The order of two transactions may be differently seen by two processors if the source of one transaction is between the two processors
    • The home node can resort to NACKs if it sees conflicting outstanding requests
    • Introduces many races in the protocol