Module 12: "Multiprocessors on a Snoopy Bus"
  Lecture 25: "Protocols for Split-transaction Buses"
 

Invalidation acks?

  • On a BusRdX or BusUpgr in case of a snoop hit in S state L2 cache sends invalidation to L1 caches
    • Does the snoop logic wait for an invalidation acknowledgment from L1 cache before the transaction can be marked complete?
    • Do we need a two-phase mechanism?
    • What are the issues?

Intervention races

  • Writebacks introduce new races in multi-level cache hierarchy
    • Suppose L2 sends a read intervention to L1 and in the meantime L1 decides to replace that line (due to some conflicting processor access)
    • The intervention will naturally miss the up-to-date copy
    • When the writeback arrives at L2, L2 realizes that the intervention race has occurred (need extra hardware to implement this logic; what hardware?)
    • When the intervention reply arrives from L1, L2 can apply the newly received writeback and launch the line on bus
    • Exactly same situation may arise even in uniprocessor if a dirty replacement from L2 misses the line in L1 because L1 just replaced that line too

Tag RAM design

  • A multi-level cache hierarchy reduces tag contention
    • L1 tags are mostly accessed by the processor because L2 cache acts as a filter for external requests
    • L2 tags are mostly accessed by the system because hopefully L1 cache can absorb most of the processor traffic
    • Still some machines maintain duplicate tags at all or the outermost level only

Exclusive cache levels

  • AMD K7 (Athlon XP) and K8 (Athlon64, Opteron) architectures chose to have exclusive levels of caches instead of inclusive
    • Definitely provides you much better utilization of on-chip caches since there is no duplication
    • But complicates many issues related to coherence
    • The uniprocessor protocol is to refill requested lines directly into L1 without placing a copy in L2; only on an L1 eviction put the line into L2; on an L1 miss look up L2 and in case of L2 hit replace line from L2 and put it in L1 (may have to replace multiple L1 lines to accommodate the full L2 line; not sure what K8 does: possible to maintain inclusion bit per L1 line sector in L2 cache)
    • For multiprocessors one solution could be to have one snoop engine per cache level and a tournament logic that selects the successful snoop result