2014
DOI: 10.1142/s0129626414500066
|View full text |Cite
|
Sign up to set email alerts
|

NUMA-Aware Multicore Matrix Multiplication

Abstract: A user-level scheduling along with a specific data alignment for matrix multiplication in cache-coherent Non-Uniform Memory Access (ccNUMA) architectures is presented. Addressing the data locality problem that could occur in such systems potentially alleviates memory bottlenecks. We show experimentally that an agnostic thread scheduler (e.g., OpenMP 3.1) from the data placement on a ccNUMA machine produces a high number of cache-misses. To overcome this memory contention problem, we show how proper memory mapp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 8 publications
0
4
0
Order By: Relevance
“…For NPB, the largest executable input sets, "C" or "D", are used. For dgemm, two 1.6K × 1.6K matrices with random values are multiplied to fully exercise the memory system [28].…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…For NPB, the largest executable input sets, "C" or "D", are used. For dgemm, two 1.6K × 1.6K matrices with random values are multiplied to fully exercise the memory system [28].…”
Section: Methodsmentioning
confidence: 99%
“…In particular, Lepers et al improved memory page migration algorithms to consider memory asymmetry [18]. There is also research that investigated data placement on NUMA machines [28,52,53,54]. These techniques aim to reduce memory latency rather than reducing bandwidth usage.…”
Section: Related Workmentioning
confidence: 99%
“…Su et al [13] proposed a hybrid-grained dynamic load-balancing method to reduce this drawback of the NUMA effect by the improved work-stealing algorithm. Wail et al [14] proposed a simple user-level thread scheduling and a specific data alignment method on the ccNUMA architecture to solve memory bottlenecks in problems with large input datasets. Smith et al [12] added parallelism to the BLIS framework for matrix multiplication appears to support high performance.…”
Section: Related Workmentioning
confidence: 99%
“…Su et al [13] proposed a hybrid-grained dynamic load-balancing method to reduce this drawback of the NUMA effect by allowing fast threads to steal work from slow ones. Wail et al [14] proposed a novel user-level scheduling and a specific data alignment method on the NUMA architecture to solve the data locality problem in such systems and alleviate memory bottlenecks in problems with large input datasets. Although we have common goals, our implementation methods and platforms are very different.…”
Section: Introductionmentioning
confidence: 99%