The Performance Implications of Locality Information Usage in Shared-Memory Multiprocessors

Bellosa, Frank; Steckermeier, Martin

doi:10.1006/jpdc.1996.0112

Cited by 29 publications

(16 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Memory affinity is the guarantee that memory access costs are reduced by either latency optimization or bandwidth increasing [1], [2]. In the last two decades, many researches have been carried out in the context of memory affinity, resulting in several proposals.…”

Section: Introductionmentioning

confidence: 99%

Memory Affinity for Hierarchical Shared Memory Multiprocessors

Ribeiro¹,

Méhaut²,

Carissimi³

et al. 2009

2009 21st International Symposium on Computer Architecture and High Performance Computing

View full text Add to dashboard Cite

Abstract-Currently, parallel platforms based on large scale hierarchical shared memory multiprocessors with NonUniform Memory Access (NUMA) are becoming a trend in scientific High Performance Computing (HPC). Due to their memory access constraints, these platforms require a very careful data distribution. Many solutions were proposed to resolve this issue. However, most of these solutions did not include optimizations for numerical scientific data (array data structures) and portability issues. Besides, these solutions provide a restrict set of memory policies to deal with data placement. In this paper, we describe an user-level interface named Memory Affinity interface (MAi) 1 , which allows memory affinity control on Linux based cache-coherent NUMA (ccNUMA) platforms. Its main goals are, fine data control, flexibility and portability. The performance of MAi is evaluated on three ccNUMA platforms using numerical scientific HPC applications, the NAS Parallel Benchmarks and a Geophysics application. The results show important gains (up to 31%) when compared to Linux default solution.

show abstract

Section: Introductionmentioning

confidence: 99%

Memory Affinity for Hierarchical Shared Memory Multiprocessors

Ribeiro¹,

Méhaut²,

Carissimi³

et al. 2009

2009 21st International Symposium on Computer Architecture and High Performance Computing

View full text Add to dashboard Cite

show abstract

“…Thus, it is imperative to carefully consider which parts of the shared data should be attributed to which physical memory bank based on the data access pattern or on other considerations. Such an attribution of data to physical main memory is often called memory affinity Bellosa and Steckermeier (1996); Kleen (2005). This notion goes hand in hand with the CPU affinity, as noted in Grant and Afsahi (2007), such that the threads are being bound to specific cores for the application start and their context switches are disabled.…”

Section: Resultsmentioning

confidence: 99%

Multicore Architecture-aware Scientific Applications

Srinivasa

2011

View full text Add to dashboard Cite

“…Recent work shows that contentions on the hardware prefetcher [25], the memory controller [27,30] and the DRAM bus [11] can also cause significant performance slowdown in both UMA and NUMA systems. Last-level cache miss rate has been widely used as a proxy for the contention on shared resources [7,8,9,14,26] and the similarity in thread address spaces has been used to quantify the inter-thread sharing activity [5,35,38].…”

Section: Optimization Via Schedulingmentioning

confidence: 99%

“…There are existing work focusing on hardware techniques [32] and program transformations [28,39,40] to mitigate the problem. Thread scheduling, a more flexible approach, has been also studied to avoid the destructive use of shared resources [7,8,11,14,30] or to use them constructively [5,35,38].…”

Section: Introductionmentioning

confidence: 99%

Optimizing virtual machine scheduling in NUMA multicore systems

Rao

Wang

Zhou

et al. 2013

2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

An increasing number of new multicore systems use the Non-Uniform Memory Access architecture due to its scalable memory performance. However, the complex interplay among data locality, contention on shared on-chip memory resources, and cross-node data sharing overhead, makes the delivery of an optimal and predictable program performance difficult. Virtualization further complicates the scheduling problem. Due to abstract and inaccurate mappings from virtual hardware to machine hardware, program and system-level optimizations are often not effective within virtual machines.We find that the penalty to access the "uncore" memory subsystem is an effective metric to predict program performance in NUMA multicore systems. Based on this metric, we add NUMA awareness to the virtual machine scheduling. We propose a Bias Random vCPU Migration (BRM) algorithm that dynamically migrates vCPUs to minimize the system-wide uncore penalty. We have implemented the scheme in the Xen virtual machine monitor. Experiment results on a two-way Intel NUMA multicore system with various workloads show that BRM is able to improve application performance by up to 31.7% compared with the default Xen credit scheduler. Moreover, BRM achieves predictable performance with, on average, no more than 2% runtime variations.

show abstract

The Performance Implications of Locality Information Usage in Shared-Memory Multiprocessors

Cited by 29 publications

References 12 publications

Memory Affinity for Hierarchical Shared Memory Multiprocessors

Memory Affinity for Hierarchical Shared Memory Multiprocessors

Multicore Architecture-aware Scientific Applications

Optimizing virtual machine scheduling in NUMA multicore systems

Contact Info

Product

Resources

About