NUMA-Aware Multicore Matrix Multiplication

Alkowaileet, Wail Y.; Carrillo-Cisneros, David; Lim, Robert V.; Scherson, Isaac D.

doi:10.1142/s0129626414500066

Cited by 3 publications

(4 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For NPB, the largest executable input sets, "C" or "D", are used. For dgemm, two 1.6K × 1.6K matrices with random values are multiplied to fully exercise the memory system [28].…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines

Wang

Davidson

Soffa

2016

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Modern NUMA platforms offer large numbers of cores to boost performance through parallelism and multi-threading. However, because performance scalability is limited by available memory bandwidth, the strategy of allocating all cores can result in degraded performance. Consequently, accurately predicting optimal (best performing) core allocations, and executing applications with these allocations are crucial for achieving the best performance. Previous research focused on the prediction of optimal numbers of cores. However, in this paper, we show that, because of the asymmetric NUMA memory configuration and the asymmetric application memory behavior, optimal core allocations are not merely optimal numbers of cores. Additionally, previous studies do not adequately consider NUMA memory resources, which further limits their ability to accurately predict optimal core allocations. In this paper, we present a model, NuCore, which predicts both memory bandwidth usage and optimal core allocations. NuCore considers various memory resources and NUMA asymmetry, and employs Integer Programming to achieve high accuracy and low overhead. Experimental results from real NUMA machines show that the core allocations predicted by NuCore provide 1.27x average speedup over using all cores with only 75.6% cores allocated. Nu-Core also provides 1.18x and 1.21x average speedups over two state-of-the-art techniques. Our results also show that NuCore faithfully models NUMA memory systems and predicts memory bandwidth usages with only 10% average error.

show abstract

“…For NPB, the largest executable input sets, "C" or "D", are used. For dgemm, two 1.6K × 1.6K matrices with random values are multiplied to fully exercise the memory system [28].…”

Section: Methodsmentioning

confidence: 99%

“…In particular, Lepers et al improved memory page migration algorithms to consider memory asymmetry [18]. There is also research that investigated data placement on NUMA machines [28,52,53,54]. These techniques aim to reduce memory latency rather than reducing bandwidth usage.…”

Section: Related Workmentioning

confidence: 99%

Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines

Wang

Davidson

Soffa

2016

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

show abstract

“…Su et al [13] proposed a hybrid-grained dynamic load-balancing method to reduce this drawback of the NUMA effect by the improved work-stealing algorithm. Wail et al [14] proposed a simple user-level thread scheduling and a specific data alignment method on the ccNUMA architecture to solve memory bottlenecks in problems with large input datasets. Smith et al [12] added parallelism to the BLIS framework for matrix multiplication appears to support high performance.…”

Section: Related Workmentioning

confidence: 99%

“…Su et al [13] proposed a hybrid-grained dynamic load-balancing method to reduce this drawback of the NUMA effect by allowing fast threads to steal work from slow ones. Wail et al [14] proposed a novel user-level scheduling and a specific data alignment method on the NUMA architecture to solve the data locality problem in such systems and alleviate memory bottlenecks in problems with large input datasets. Although we have common goals, our implementation methods and platforms are very different.…”

Section: Introductionmentioning

confidence: 99%

NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

Zhang

Jiang

et al. 2021

Electronics

View full text Add to dashboard Cite

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.

show abstract

Task-aware Scheduling and Performance Optimization on Yitian710 SoC for GEMM-based Workloads on the Cloud

Wang

et al. 2023

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)

View full text Add to dashboard Cite

NUMA-Aware Multicore Matrix Multiplication

Cited by 3 publications

References 8 publications

Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines

Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines

NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

Task-aware Scheduling and Performance Optimization on Yitian710 SoC for GEMM-based Workloads on the Cloud

Contact Info

Product

Resources

About