Memory Affinity for Hierarchical Shared Memory Multiprocessors

Ribeiro, Christiane Pousa; Méhaut, Jean‐François; Carissimi, Alícia; Castro, Márcio; Fernandes, Luiz Gustavo

doi:10.1109/sbac-pad.2009.16

Cited by 45 publications

(39 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This organization, which can be easily implemented with specialized memory allocator routines (libnuma library for instance) still contains remote memory accesses which might become really hindering with larger images. To solve this, we chose to implement a block-cyclic allocation [43] using a memory binding routine, which can perform a remapping of a given memory chunk onto a specific node. The core of the corresponding C is provided in figure 15.…”

Section: Numa-aware Adaptationmentioning

confidence: 99%

Harris corner detection on a NUMA manycore

Haggui

Tadonki

Lacassagne

et al. 2018

Future Generation Computer Systems

View full text Add to dashboard Cite

Corner detection is a key kernel for many image processing procedures including pattern recognition and motion detection. The latter, for instance, mainly relies on the corner points for which spatial analyses are performed, typically on (probably live) videos or temporal flows of images. Thus, highly efficient corner detection is essential to meet the real-time requirement of associated applications. In this paper, we consider the corner detection algorithm proposed by Harris, whose the main work-flow is a composition of basic operators represented by their approximations using 3 × 3 matrices. The corresponding data access patterns follow a stencil model, which is known to require careful memory organization and management. Cache misses and other additional hindering factors with NUMA architectures need to be skillfully addressed in order to reach an efficient scalable implementation. In addition, with an increasingly wide vector registers, an efficient SIMD version should be designed and explicitly implemented. In this paper, we study a direct and explicit implementation of common and novel optimization strategies, and provide a NUMA-aware parallelization. Experimental results on a dual-socket INTEL Bradwell-E/EP show a noticeably good scalability performance.

show abstract

Section: Numa-aware Adaptationmentioning

confidence: 99%

Harris corner detection on a NUMA manycore

Haggui

Tadonki

Lacassagne

et al. 2018

Future Generation Computer Systems

View full text Add to dashboard Cite

show abstract

“…Memphis [11] evaluated its effectiveness by applying the NPB (NAS Parallel Benchmarks), HYCOM (a production ocean modeling application), XGC1 (a production Fortran90 particle-in-cell code that models several aspects of plasmas in a tokamak thermonuclear fusion reactor) and CAM (the Community Atmosphere Model). MAi [7] used two kernels (FFT and CG) from NPB and ICTM [15]. SPLASH2, PARSEC and Advention (a part of the Brazilian Regional Atmosphere Modeling System) were used in [13].…”

Section: Related Workmentioning

confidence: 99%

“…It means that multithreaded codes in NUMA platform should sustain sufficient locality of memory access and minimize access to remote data to obtain a high performance. The importance of the data locality is well documented [1][2][3] [4] and there are some OS-provided NUMA APIs to control it [5][6] [7] [8]. Linux traditionally had ways to bind threads to specific CPUs/Cores and NUMA API extends that to allow programs to specify on which node memory should be allocated.…”

Section: Introductionmentioning

confidence: 99%

“…Linux traditionally had ways to bind threads to specific CPUs/Cores and NUMA API extends that to allow programs to specify on which node memory should be allocated. Some more complicated APIs are based on these basic policies, such as MAi [7] and MaMI [9].It is not an easy task to apply these API because it is much difficult to find the communication pattern in shared memory platform than message passing platform, because it is implicit and occurs through the memory accesses. Recently, some tools are available to guide a program developer on where to judiciously apply these API within a large parallel code [10][11] [12].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

MAP-numa: Access Patterns Used to Characterize the NUMA Memory Access Optimization Techniques and Algorithms

Luo

Liu

Kong

et al. 2012

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Some typical memory access patterns are provided and programmed in C, which can be used as benchmark to characterize the various techniques and algorithms aim to improve the performance of NUMA memory access. These access patterns, called MAP-numa (Memory Access Patterns for NUMA), currently include three classes, whose working data sets are corresponding to 1-dimension array, 2-dimension matrix and 3-dimension cube. It is dedicated for NUMA memory access optimization other than measuring the memory bandwidth and latency. MAP-numa is an alternative to those exist benchmarks such as STREAM, pChase, etc. It is used to verify the optimizations' (made automatically/manually to source code/executive binary) capacities by investigating what locality leakage can be remedied. Some experiment results are shown, which give an example of using MAP-numa to evaluate some optimizations based on Oprofile sampling.

show abstract

“…Threads that access a large amount of shared data should be mapped to cores that are close to each other in the memory hierarchy, while data should be mapped to the same NUMA node that the threads that access it are executing on [22]. In this way, the locality of the memory accesses is improved, which leads to an increase of performance and energy efficiency.…”

Section: Introductionmentioning

confidence: 99%

A Sharing-Aware Memory Management Unit for Online Mapping in Multi-core Architectures

Cruz

Diener

Pilla

et al. 2016

Euro-Par 2016: Parallel Processing

View full text Add to dashboard Cite

Abstract. In modern shared-memory architectures, it is important to map threads and data in a way that increases the locality of their memory accesses, thereby improving performance and energy efficiency. Threads that access shared data should be mapped close to each other in the memory hierarchy, while the data they access should be mapped to their NUMA node, which is called sharing-aware mapping. In this paper, we propose SAMMU, which adds sharing-awareness to the memory management unit in current architectures. SAMMU analyzes the memory access behavior in hardware and provides information to the operating system so it can perform an online mapping of threads and data. In the evaluation with a wide range of parallel applications, performance was improved by up to 35.7% (13.1% on average).

show abstract

Memory Affinity for Hierarchical Shared Memory Multiprocessors

Cited by 45 publications

References 8 publications

Harris corner detection on a NUMA manycore

Harris corner detection on a NUMA manycore

MAP-numa: Access Patterns Used to Characterize the NUMA Memory Access Optimization Techniques and Algorithms

A Sharing-Aware Memory Management Unit for Online Mapping in Multi-core Architectures

Contact Info

Product

Resources

About