Memory system performance in a NUMA multicore multiprocessor

Majó, Zoltán; Groß, Thomas

doi:10.1145/1987816.1987832

Cited by 70 publications

(43 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [15] the authors evaluate the memory performance of NUMA machines. One of the main findings is how guaranteeing data locality to sockets need not be optimal always, due to increased pressure on local memory bandwidth.…”

Section: Related Workmentioning

confidence: 99%

NUMA obliviousness through memory mapping

Gawade

Kersten

2015

Proceedings of the 11th International Workshop on Data Management on New Hardware

View full text Add to dashboard Cite

With the rise of multi-socket multi-core CPUs a lot of effort is being put into how to best exploit their abundant CPU power. In a shared memory setting the multi-socket CPUs are equipped with their own memory module, and access memory modules across sockets in a non-uniform access pattern (NUMA). Memory access across socket is relatively expensive compared to memory access within a socket. One of the common solutions to minimize across socket memory access is to partition the data, such that the data affinity is maintained per socket.In this paper we explore the role of memory mapped storage to provide transparent data access in a NUMA environment, without the need of explicit data partitioning. We compare the performance of a database engine in a distributed setting in a multi-socket environment, with a database engine in a NUMA oblivious setting. We show that though the operating system tries to keep the data affinity to local sockets, a significant remote memory access still occurs, as the number of threads increase. Hence, setting explicit process and memory affinity results into a robust execution in NUMA oblivious plans. We use micro-experiments and SQL queries from the TPC-H benchmark to provide an in-depth experimental exploration of the landscape, in a four socket Intel machine.

show abstract

Section: Related Workmentioning

confidence: 99%

NUMA obliviousness through memory mapping

Gawade

Kersten

2015

Proceedings of the 11th International Workshop on Data Management on New Hardware

View full text Add to dashboard Cite

show abstract

“…Majo and Gross investigated the NUMA-memory contention problem and developed a model to characterize the sharing of local and remote memory bandwidth [15]. Fedorova et al designed a contention-aware algorithm Carrefour to manage memory traffic congestion in the Linux OS [11].…”

Section: Related Workmentioning

confidence: 99%

Scaling up matrix computations on shared-memory manycore systems with 1000 CPU cores

Song

Dongarra

2014

Proceedings of the 28th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

While the growing number of cores per chip allows researchers to solve larger scientific and engineering problems, the parallel efficiency of the deployed parallel software starts to decrease. This unscalability problem happens to both vendorprovided and open-source software and wastes CPU cycles and energy. By expecting CPUs with hundreds of cores to be imminent, we have designed a new framework to perform matrix computations for massively many cores. Our performance analysis on manycore systems shows that the unscalability bottleneck is related to Non-Uniform Memory Access (NUMA): memory bus contention and remote memory access latency. To overcome the bottleneck, we have designed NUMA-aware tile algorithms with the help of a dynamic scheduling runtime system to minimize NUMA memory accesses. The main idea is to identify the data that is, either read a number of times or written once by a thread resident on a remote NUMA node, then utilize the runtime system to conduct data caching and movement between different NUMA nodes. Based on the experiments with QR factorizations, we demonstrate that our framework is able to achieve great scalability on a 48-core AMD Opteron system (e.g., parallel efficiency drops only 3% from one core to 48 cores). We also deploy our framework to an extreme-scale shared-memory SGI machine which has 1024 CPU cores and runs a single Linux operating system image. Our framework continues to scale well, and can outperform the vendoroptimized Intel MKL library by up to 750%.

show abstract

“…Recent work shows that contentions on the hardware prefetcher [25], the memory controller [27,30] and the DRAM bus [11] can also cause significant performance slowdown in both UMA and NUMA systems. Last-level cache miss rate has been widely used as a proxy for the contention on shared resources [7,8,9,14,26] and the similarity in thread address spaces has been used to quantify the inter-thread sharing activity [5,35,38].…”

Section: Optimization Via Schedulingmentioning

confidence: 99%

Optimizing virtual machine scheduling in NUMA multicore systems

Rao

Wang

Zhou

et al. 2013

2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

An increasing number of new multicore systems use the Non-Uniform Memory Access architecture due to its scalable memory performance. However, the complex interplay among data locality, contention on shared on-chip memory resources, and cross-node data sharing overhead, makes the delivery of an optimal and predictable program performance difficult. Virtualization further complicates the scheduling problem. Due to abstract and inaccurate mappings from virtual hardware to machine hardware, program and system-level optimizations are often not effective within virtual machines.We find that the penalty to access the "uncore" memory subsystem is an effective metric to predict program performance in NUMA multicore systems. Based on this metric, we add NUMA awareness to the virtual machine scheduling. We propose a Bias Random vCPU Migration (BRM) algorithm that dynamically migrates vCPUs to minimize the system-wide uncore penalty. We have implemented the scheme in the Xen virtual machine monitor. Experiment results on a two-way Intel NUMA multicore system with various workloads show that BRM is able to improve application performance by up to 31.7% compared with the default Xen credit scheduler. Moreover, BRM achieves predictable performance with, on average, no more than 2% runtime variations.

show abstract

Memory system performance in a NUMA multicore multiprocessor

Cited by 70 publications

References 25 publications

NUMA obliviousness through memory mapping

NUMA obliviousness through memory mapping

Scaling up matrix computations on shared-memory manycore systems with 1000 CPU cores

Optimizing virtual machine scheduling in NUMA multicore systems

Contact Info

Product

Resources

About