| |
Latency tolerance
- Page placement
- Application-directed page placement is used in many cases to minimize the number of remote misses
- The application provides the kernel (via system call) the starting address and ending address of a chunk of memory (multiple of pages) and also the node number where to map these pages
- Thus an application writer can specify (based on sharing pattern) which shared pages he/she wants to map in a node’s local memory (private pages and stack pages are mapped on local memory by default)
- The page fault handler of a NUMA kernel is normally equipped with some default policies e.g., round robin mapping or first-touch mapping
- Examples: matrix-vector multiplication, matrix transpose
- Software prefetching
- Even after rigorous analysis of a parallel application it may not be possible to map all pages used by a node on its local memory: the same page may be used by multiple nodes at different times (example: matrix-vector multiplication)
- Two options are available: dynamic page migration (very costly; coming next, stay tuned) or software prefetching
- Today, most microprocessors support prefetch and prefetch exclusive instructions: use prefetch to initiate a cache line read miss long before application actually accesses it; use prefetch exclusive if you know for sure that you will write to the line and no one else will need the line before you write to it
- Prefetches must be used carefully
- Swap with source of migrating cache block
- Early prefetches may evict useful cache blocks and itself may get evicted before use; may generate extra invalidations or interventions in multiprocessors
- Late prefetches may not be fully effective, but at least less harmful than early prefetches
- Wrong prefetches are most dangerous: these bring in cache blocks that may not be used at all in near-future; in multiprocessors this can severely hurt performance by generating extra invalidations and interventions
- Wrong prefetches waste bandwidth and pollute cache
- Software prefetching usually offers better control than hardware prefetching
- Software prefetching vs. hardware prefetching
- Software prefetching requires close analysis of program; profile information may help
- Hardware prefetching tries to detect patterns in accessed addresses and using the detected patterns predicts future addresses
- AMD Athlon has a simple next line prefetcher (works perfectly for most numerical applications)
- Intel Pentium 4 has a very sophisticated stream prefetcher
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
|
|
|
|