Module 4: "Recap: Virtual Memory and Caches"
  Lecture 8: "Cache Hierarchy and Memory-level Parallelism"
 

The first instruction

  • Accessing the first instruction
    • Take the starting PC
    • Access iTLB with the VPN extracted from PC: iTLB miss
    • Invoke iTLB miss handler
    • Calculate PTE address
    • If PTEs are cached in L1 data and L2 caches, look them up with PTE address: you will miss there also
    • Access page table in main memory: PTE is invalid: page fault
    • Invoke page fault handler
    • Allocate page frame, read page from disk, update PTE,  load PTE in iTLB, restart fetch
  • Now you have the physical address
    • Access Icache: miss
    • Send refill request to higher levels: you miss everywhere
    • Send request to memory controller (north bridge)
    • Access main memory
    • Read cache line
    • Refill all levels of cache as the cache line returns to the processor
    • Extract the appropriate instruction from the cache line with the block offset
  • This is the longest possible latency in an instruction/data access

TLB access

  • For every cache access (instruction or data) you need to access the TLB first
  • Puts the TLB in the critical path
  • Want to start indexing into cache and read the tags while TLB lookup takes place
    • Virtually indexed physically tagged cache
    • Extract index from the VA, start reading tag while looking up TLB
    • Once the PA is available do tag comparison
    • Overlaps TLB reading and tag reading

Memory op latency

  • L1 hit: ~1 ns
  • L2 hit: ~5 ns
  • L3 hit: ~10-15 ns
  • Main memory: ~70 ns DRAM access time + bus transfer etc. = ~110-120 ns
  • If a load misses in all caches it will eventually come to the head of the ROB and block instruction retirement (in-order retirement is a must)
  • Gradually, the pipeline backs up, processor runs out of resources such as ROB entries and physical registers
  • Ultimately, the fetcher stalls: severely limits ILP