Objectives_template

	Latency tolerance Page placement Application-directed page placement is used in many cases to minimize the number of remote misses The application provides the kernel (via system call) the starting address and ending address of a chunk of memory (multiple of pages) and also the node number where to map these pages Thus an application writer can specify (based on sharing pattern) which shared pages he/she wants to map in a node’s local memory (private pages and stack pages are mapped on local memory by default) The page fault handler of a NUMA kernel is normally equipped with some default policies e.g., round robin mapping or first-touch mapping Examples: matrix-vector multiplication, matrix transpose Software prefetching Even after rigorous analysis of a parallel application it may not be possible to map all pages used by a node on its local memory: the same page may be used by multiple nodes at different times (example: matrix-vector multiplication) Two options are available: dynamic page migration (very costly; coming next, stay tuned) or software prefetching Today, most microprocessors support prefetch and prefetch exclusive instructions: use prefetch to initiate a cache line read miss long before application actually accesses it; use prefetch exclusive if you know for sure that you will write to the line and no one else will need the line before you write to it Prefetches must be used carefully Swap with source of migrating cache block Early prefetches may evict useful cache blocks and itself may get evicted before use; may generate extra invalidations or interventions in multiprocessors Late prefetches may not be fully effective, but at least less harmful than early prefetches Wrong prefetches are most dangerous: these bring in cache blocks that may not be used at all in near-future; in multiprocessors this can severely hurt performance by generating extra invalidations and interventions Wrong prefetches waste bandwidth and pollute cache Software prefetching usually offers better control than hardware prefetching Software prefetching vs. hardware prefetching Software prefetching requires close analysis of program; profile information may help Hardware prefetching tries to detect patterns in accessed addresses and using the detected patterns predicts future addresses AMD Athlon has a simple next line prefetcher (works perfectly for most numerical applications) Intel Pentium 4 has a very sophisticated stream prefetcher