Module 14: "Directory-based Cache Coherence"
  Lecture 31: "Managing Directory Overhead"
 

Latency tolerance

  • Page placement
    • Application-directed page placement is used in many cases to minimize the number of remote misses
    • The application provides the kernel (via system call) the starting address and ending address of a chunk of memory (multiple of pages) and also the node number where to map these pages
    • Thus an application writer can specify (based on sharing pattern) which shared pages he/she wants to map in a node’s local memory (private pages and stack pages are mapped on local memory by default)
    • The page fault handler of a NUMA kernel is normally equipped with some default policies e.g., round robin mapping or first-touch mapping
    • Examples: matrix-vector multiplication, matrix transpose
  • Software prefetching
    • Even after rigorous analysis of a parallel application it may not be possible to map all pages used by a node on its local memory: the same page may be used by multiple nodes at different times (example: matrix-vector multiplication)
    • Two options are available: dynamic page migration (very costly; coming next, stay tuned) or software prefetching
    • Today, most microprocessors support prefetch and prefetch exclusive instructions: use prefetch to initiate a cache line read miss long before application actually accesses it; use prefetch exclusive if you know for sure that you will write to the line and no one else will need the line before you write to it
    • Prefetches must be used carefully
      • Swap with source of migrating cache block
      • Early prefetches may evict useful cache blocks and itself may get evicted before use; may generate extra invalidations or interventions in multiprocessors
      • Late prefetches may not be fully effective, but at least less harmful than early prefetches
      • Wrong prefetches are most dangerous: these bring in cache blocks that may not be used at all in near-future; in multiprocessors this can severely hurt performance by generating extra invalidations and interventions
      • Wrong prefetches waste bandwidth and pollute cache
    • Software prefetching usually offers better control than hardware prefetching
  • Software prefetching vs. hardware prefetching
    • Software prefetching requires close analysis of program; profile information may help
    • Hardware prefetching tries to detect patterns in accessed addresses and using the detected patterns predicts future addresses
      • AMD Athlon has a simple next line prefetcher (works perfectly for most numerical applications)
      • Intel Pentium 4 has a very sophisticated stream prefetcher