|
TLB coherence
- A page table entry (PTE) may be held in multiple processors in shared memory because all of them access the same shared page
- A PTE may get modified when the page is swapped out and/or access permissions are changed
- Must tell all processors having this PTE to invalidate
- How to do it efficiently?
- No TLB: virtually indexed virtually tagged L1 caches
- On L1 miss directly access PTE in memory and bring it to cache; then use normal cache coherence because the PTEs also reside in the shared memory segment
- On page replacement the page fault handler can flush the cache line containing the replaced PTE
- Too impractical: fully virtual caches are rare, still uses a TLB for upper levels (Alpha 21264 instruction cache)
- Hardware solution
- Extend snoop logic to handle TLB coherence
- PowerPC family exercises a tlbie instruction (TLB invalidate entry)
- When OS modifies a PTE it puts a tlbie instruction on bus
- Snoop logic picks it up and invalidates the TLB entry if present in all processors
- This is well suited for bus-based SMPs, but not for DSMs because broadcast in a large-scale machine is not good
TLB shootdown
- Popular TLB coherence solution
- Invoked by an initiator (the processor which modifies the PTE) by sending interrupt to processors that might be caching the PTE in TLBs; before doing so OS also locks the involved PTE to avoid any further access to it in case of TLB misses from other processors
- The receiver of the interrupt simply invalidates the involved PTE if it is present in its TLB and sets a flag in shared memory on which the initiator polls
- On completion the initiator unlocks the PTE
- SGI Origin uses a lazy TLB shootdown i.e. it invalidates a TLB entry only when a processor tries to access it next time (will discuss in detail)
Snooping on a ring
- Length of the bus limits the frequency at which it can be clocked which in turn limits the bandwidth offered by the bus leading to a limited number of processors
- A ring interconnect provides a better solution
- Connect a processor only to its two neighbors
- Short wires, much higher switching frequency, better bandwidth, more processors
- Each node has private local memory (more like a distributed shared memory multiprocessor)
- Every cache line has a home node i.e. the node where the memory contains this line: can be determined by upper few bits of the PA
- Transactions traverse the ring node by node
- Snoop mechanism
- When a transaction passes by the ring interface of a node it snoops the transaction, takes appropriate coherence actions, and forwards the transaction to its neighbor if necessary
- The home node also receives the transaction eventually and let’s assume that it has a dirty bit associated with every memory line (otherwise you need a two-phase protocol)
- A request transaction is removed from the ring when it comes back to the requester (serves as an acknowledgment that every node has seen the request)
- The ring is essentially divided into time slots where a node can insert new request or response; if there is no free time slot it must wait until one passes by: called a slotted ring
- The snoop logic must be able to finish coherence actions for a transaction before the next time slot arrives
- The main problem of a ring is the end-to-end latency, since the transactions must traverse hop-by-hop
- Serialization and sequential consistency is trickier
- The order of two transactions may be differently seen by two processors if the source of one transaction is between the two processors
- The home node can resort to NACKs if it sees conflicting outstanding requests
- Introduces many races in the protocol
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|