Proceedings of the Tenth International Symposium on Code Generation and Optimization 2012
DOI: 10.1145/2259016.2259046
|View full text |Cite
|
Sign up to set email alerts
|

Matching memory access patterns and data placement for NUMA systems

Abstract: Many recent multicore multiprocessors are based on a nonuniform memory architecture (NUMA). A mismatch between the data access patterns of programs and the mapping of data to memory incurs a high overhead, as remote accesses have higher latency and lower throughput than local accesses. This paper reports on a limit study that shows that many scientific loop-parallel programs include multiple, mutually incompatible data access patterns, therefore these programs encounter a high fraction of costly remote memory … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
25
0

Year Published

2012
2012
2015
2015

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 51 publications
(25 citation statements)
references
References 24 publications
0
25
0
Order By: Relevance
“…Zhang et al [40] transform programs in a cache-sharing-aware manner and Majo et al [28] aim to preserve program data locality on NUMA systems. Such transformations can be further made transparent to users and automated at compiler level [39].…”
Section: Program and System-level Optimizationmentioning
confidence: 99%
See 1 more Smart Citation
“…Zhang et al [40] transform programs in a cache-sharing-aware manner and Majo et al [28] aim to preserve program data locality on NUMA systems. Such transformations can be further made transparent to users and automated at compiler level [39].…”
Section: Program and System-level Optimizationmentioning
confidence: 99%
“…Sub-optimal and unpredictable program performance due to shared on-chip resources remains top concern as it seriously compromises the efficiency, fairness, and Quality-of-Service (QoS) that the platform is capable to provide [41]. There are existing work focusing on hardware techniques [32] and program transformations [28,39,40] to mitigate the problem. Thread scheduling, a more flexible approach, has been also studied to avoid the destructive use of shared resources [7,8,11,14,30] or to use them constructively [5,35,38].…”
Section: Introductionmentioning
confidence: 99%
“…NUMA has provided much larger capacity of shared memory or even shared cache, but remote memory accesses suffer much longer latency and lower bandwidth than local memory accesses, which, if not properly handled, may lead to serious performance reduction [10,[28][29][30]32].…”
Section: Numa Architecture and Its Influence On Bfsmentioning
confidence: 99%
“…Though NUMA can provide a much powerful node with more than one processor -more cores and memory, it can be a significant problem for applications due to the congestion on cross-chip interconnects, long latency and potential bandwidth saturation of remote memory accesses [10,[28][29][30]32]. For example, it is shown in [32] that the performance of a multi-threaded program when running on 4 cores (1 socket) and 8 cores (2 sockets) are almost the same.…”
Section: Introductionmentioning
confidence: 99%
“…If their specific hardware characteristics are not taken into consideration, concurrency for the shared resources might cause important performance degradation [7], [8]. Efficient utilization of these machines has been a very active field of research [9], [10], [11], [12]. However, to the best of our knowledge, there has been no research on the adaptation of actor model REs for these platforms.…”
Section: Introductionmentioning
confidence: 99%