Matching memory access patterns and data placement for NUMA systems

Majó, Zoltán; Groß, Thomas

doi:10.1145/2259016.2259046

Cited by 51 publications

(25 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zhang et al [40] transform programs in a cache-sharing-aware manner and Majo et al [28] aim to preserve program data locality on NUMA systems. Such transformations can be further made transparent to users and automated at compiler level [39].…”

Section: Program and System-level Optimizationmentioning

confidence: 99%

“…Sub-optimal and unpredictable program performance due to shared on-chip resources remains top concern as it seriously compromises the efficiency, fairness, and Quality-of-Service (QoS) that the platform is capable to provide [41]. There are existing work focusing on hardware techniques [32] and program transformations [28,39,40] to mitigate the problem. Thread scheduling, a more flexible approach, has been also studied to avoid the destructive use of shared resources [7,8,11,14,30] or to use them constructively [5,35,38].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Optimizing virtual machine scheduling in NUMA multicore systems

Rao

Wang

Zhou

et al. 2013

2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

An increasing number of new multicore systems use the Non-Uniform Memory Access architecture due to its scalable memory performance. However, the complex interplay among data locality, contention on shared on-chip memory resources, and cross-node data sharing overhead, makes the delivery of an optimal and predictable program performance difficult. Virtualization further complicates the scheduling problem. Due to abstract and inaccurate mappings from virtual hardware to machine hardware, program and system-level optimizations are often not effective within virtual machines.We find that the penalty to access the "uncore" memory subsystem is an effective metric to predict program performance in NUMA multicore systems. Based on this metric, we add NUMA awareness to the virtual machine scheduling. We propose a Bias Random vCPU Migration (BRM) algorithm that dynamically migrates vCPUs to minimize the system-wide uncore penalty. We have implemented the scheme in the Xen virtual machine monitor. Experiment results on a two-way Intel NUMA multicore system with various workloads show that BRM is able to improve application performance by up to 31.7% compared with the default Xen credit scheduler. Moreover, BRM achieves predictable performance with, on average, no more than 2% runtime variations.

show abstract

Section: Program and System-level Optimizationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Optimizing virtual machine scheduling in NUMA multicore systems

Rao

Wang

Zhou

et al. 2013

2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

show abstract

“…NUMA has provided much larger capacity of shared memory or even shared cache, but remote memory accesses suffer much longer latency and lower bandwidth than local memory accesses, which, if not properly handled, may lead to serious performance reduction [10,[28][29][30]32].…”

Section: Numa Architecture and Its Influence On Bfsmentioning

confidence: 99%

“…Though NUMA can provide a much powerful node with more than one processor -more cores and memory, it can be a significant problem for applications due to the congestion on cross-chip interconnects, long latency and potential bandwidth saturation of remote memory accesses [10,[28][29][30]32]. For example, it is shown in [32] that the performance of a multi-threaded program when running on 4 cores (1 socket) and 8 cores (2 sockets) are almost the same.…”

Section: Introductionmentioning

confidence: 99%

Evaluation and Optimization of Breadth-First Search on NUMA Cluster

Cui

Chen

et al. 2012

2012 IEEE International Conference on Cluster Computing

View full text Add to dashboard Cite

Graph is widely used in many areas. Breadth-First Search (BFS), a key subroutine for many graph analysis algorithms, has become the primary benchmark for Graph500 ranking. Due to the high communication cost of BFS, multisocket nodes with large memory capacity (NUMA) are supposed to reduce network pressure. However, the longer latency to remote memory may cause problem if not treated well. In this work, we first demonstrate that simply spawning and binding one MPI process for each socket can achieve the best performance for MPI/OpenMP hybrid programmed BFS algorithm, resulting in 1.53X of performance on 16 nodes. Nevertheless, we notice that one MPI process per socket may exacerbate the communication cost. We propose to share some communication data structure among the processes inside the same node, to eliminate most of the intra-node communication.To fully utilize the network bandwidth, we make all the processes in a node to perform communication simultaneously. We further adjust the granularity of a key bitmap for better cache locality to speed up the computation. With all the optimizations for NUMA, communication and computation together, 2.44X of performance is achieved on 16 nodes, which is 39.2 Billion Traversed Edges per Second for an R-MAT graph of scale 32 (4 billion vertices and 64 billion edges).

show abstract

“…If their specific hardware characteristics are not taken into consideration, concurrency for the shared resources might cause important performance degradation [7], [8]. Efficient utilization of these machines has been a very active field of research [9], [10], [11], [12]. However, to the best of our knowledge, there has been no research on the adaptation of actor model REs for these platforms.…”

Section: Introductionmentioning

confidence: 99%

A NUMA-Aware Runtime Environment for the Actor Model

Francesquini¹,

Goldman

Méhaut³

2013

2013 42nd International Conference on Parallel Processing

View full text Add to dashboard Cite

The actor model is present in several missioncritical systems, such as those supporting WhatsApp and Twitter. These systems serve thousands of clients simultaneously, therefore demanding substantial computing resources usually provided by multiprocessor and multicore platforms. Non-Uniform Memory Access (NUMA) architectures account for an important share of these platforms. Yet, little or no research has been done on the suitability of the current actor runtime environments for these machines. Current runtime environments assume a flat memory space, thus not performing as well as they could. The NUMA environment presents challenges to the actor model runtime environment in fields varying from memory management to scheduling and loadbalancing. In this document we analyze and characterize actor based applications to, in light of the above, propose improvements to actor runtime environments. As a proof of concept, we have applied our ideas in a real actor runtime environment, the Erlang virtual machine. This modified virtual machine uses the NUMA characteristics and the application knowledge to take better memory management, scheduling and load-balancing decisions. We have evaluated this modified runtime environment using standard benchmarks and, taking the default virtual machine as a baseline, we improved the performance of the tested applications by a factor of 2.50 on the best case while limiting our slowdown on the worst case by a factor of 1.09.

show abstract

Matching memory access patterns and data placement for NUMA systems

Cited by 51 publications

References 24 publications

Optimizing virtual machine scheduling in NUMA multicore systems

Optimizing virtual machine scheduling in NUMA multicore systems

Evaluation and Optimization of Breadth-First Search on NUMA Cluster

A NUMA-Aware Runtime Environment for the Actor Model

Contact Info

Product

Resources

About