Objectives_template

	TLB coherence A page table entry (PTE) may be held in multiple processors in shared memory because all of them access the same shared page A PTE may get modified when the page is swapped out and/or access permissions are changed Must tell all processors having this PTE to invalidate How to do it efficiently? No TLB: virtually indexed virtually tagged L1 caches On L1 miss directly access PTE in memory and bring it to cache; then use normal cache coherence because the PTEs also reside in the shared memory segment On page replacement the page fault handler can flush the cache line containing the replaced PTE Too impractical: fully virtual caches are rare, still uses a TLB for upper levels (Alpha 21264 instruction cache) Hardware solution Extend snoop logic to handle TLB coherence PowerPC family exercises a tlbie instruction (TLB invalidate entry) When OS modifies a PTE it puts a tlbie instruction on bus Snoop logic picks it up and invalidates the TLB entry if present in all processors This is well suited for bus-based SMPs, but not for DSMs because broadcast in a large-scale machine is not good TLB shootdown Popular TLB coherence solution Invoked by an initiator (the processor which modifies the PTE) by sending interrupt to processors that might be caching the PTE in TLBs; before doing so OS also locks the involved PTE to avoid any further access to it in case of TLB misses from other processors The receiver of the interrupt simply invalidates the involved PTE if it is present in its TLB and sets a flag in shared memory on which the initiator polls On completion the initiator unlocks the PTE SGI Origin uses a lazy TLB shootdown i.e. it invalidates a TLB entry only when a processor tries to access it next time (will discuss in detail) Snooping on a ring Length of the bus limits the frequency at which it can be clocked which in turn limits the bandwidth offered by the bus leading to a limited number of processors A ring interconnect provides a better solution Connect a processor only to its two neighbors Short wires, much higher switching frequency, better bandwidth, more processors Each node has private local memory (more like a distributed shared memory multiprocessor) Every cache line has a home node i.e. the node where the memory contains this line: can be determined by upper few bits of the PA Transactions traverse the ring node by node Snoop mechanism When a transaction passes by the ring interface of a node it snoops the transaction, takes appropriate coherence actions, and forwards the transaction to its neighbor if necessary The home node also receives the transaction eventually and let’s assume that it has a dirty bit associated with every memory line (otherwise you need a two-phase protocol) A request transaction is removed from the ring when it comes back to the requester (serves as an acknowledgment that every node has seen the request) The ring is essentially divided into time slots where a node can insert new request or response; if there is no free time slot it must wait until one passes by: called a slotted ring The snoop logic must be able to finish coherence actions for a transaction before the next time slot arrives The main problem of a ring is the end-to-end latency, since the transactions must traverse hop-by-hop Serialization and sequential consistency is trickier The order of two transactions may be differently seen by two processors if the source of one transaction is between the two processors The home node can resort to NACKs if it sees conflicting outstanding requests Introduces many races in the protocol