NumaMMA

Selva, Manuel; Morel, Lionel; Marquet, Kevin

doi:10.1145/3225058.3225094

Cited by 19 publications

(6 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The mappings that require detailed profiler/programmer support include: locality (each page is allocated in the node of the cores that will access the page the most), and balance (pages are spread across the nodes in such a way that the total amount of memory accesses to each node is approximately the same). These mappings require profiling the application's access pattern and implicitly assume that the patterns are reasonably stable across different runs and inputs, which has been shown to be a fair assumption for these benchmarks [34,40].…”

Section: Pagementioning

confidence: 99%

“…Codelets have been shown to be quite accurate for both microarchitectural evaluation [12] and NUMA configuration studies [35]. This is because parallel regions typically exhibit similar behavior [40]. For our fork-join applications, we extract codelets for instances of each important OpenMP parallel region.…”

Section: Faster Evaluation: Sampling With Codeletsmentioning

confidence: 99%

“…With the model prediction, we migrate pages and threads as needed (Migration), and then execute with the chosen configuration (Optimized execution). This approach assumes that parallel regions have similar behavior, which is quite common in our benchmark applications [34,40].…”

Section: Optimizing Applications Online 61 Online Profiling and Optimentioning

confidence: 99%

“…However, this necessitates very low overhead profiling to avoid outweighing the optimization gains. Offline [2,15,40] methods avoid the need for low-overhead profiling, but require an additional execution and can be sensitive to changes in input data.…”

Section: Related Workmentioning

confidence: 99%

“…The overhead of collecting this information can be reduced to 12 %[40], but in this work we use Pin directly, which is around 10× slower.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Modeling and optimizing NUMA effects and prefetching with machine learning

Barrera

Black-Schaffer

Casas

et al. 2020

Proceedings of the 34th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Both NUMA thread/data placement and hardware prefetcher configuration have significant impacts on HPC performance. Optimizing both together leads to a large and complex design space that has previously been impractical to explore at runtime. In this work we deliver the performance benefits of optimizing both NUMA thread/data placement and prefetcher configuration at runtime through careful modeling and online profiling. To address the large design space, we propose a prediction model that reduces the amount of input information needed and the complexity of the prediction required. We do so by selecting a subset of performance counters and application configurations that provide the richest profile information as inputs, and by limiting the output predictions to a subset of configurations that cover most of the performance. Our model is robust and can choose near-optimal NUMA+Prefetcher configurations for applications from only two profile runs. We further demonstrate how to profile online with low overhead, resulting in a technique that delivers an average of 1.68× performance improvement over a locality-optimized NUMA baseline with all prefetchers enabled. CCS CONCEPTS • Computer systems organization → Multicore architectures; • General and reference → Performance; Measurement; • Software and its engineering → Memory management; • Computing methodologies → Cluster analysis; Supervised learning; Crossvalidation; Model verification and validation; Model development and analysis.

show abstract

Section: Pagementioning

confidence: 99%