Application Performance Tuning for Clusters with ccNUMA Nodes

Kayi, Abdullah; Kornkven, E.; El‐Ghazawi, Tarek; Newby, Gregory B.

doi:10.1109/cse.2008.46

Cited by 9 publications

(4 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, all benchmarks have been executed bound to a subset of the available CPU sockets, utilizing all cores on these sockets to simulate systems with different numbers of sockets. For LAMA we have done this with numactl and for PETSc we have bound the mpd daemon of MPICH to a socket with taskset to enforce the PETSc processes to only utilize these specific sockets [15,16]. When Using taskset, it has to be taken into account that the numbering of CPU cores is not always as expected [20].…”

Section: Executionmentioning

confidence: 99%

Scalable parallel AMG on ccNUMA machines with OpenMP

Förster

Kraus

2011

Comput Sci Res Dev

View full text Add to dashboard Cite

In many numerical simulation codes the backbone of the application covers the solution of linear systems of equations. Often, being created via a discretization of differential equations, the corresponding matrices are very sparse. One popular way to solve these sparse linear systems are multigrid methods-in particular AMGbecause of their numerical scalability. But looking at modern multi-core architectures, also the parallel scalability has to be taken into account. With the memory bandwidth usually being the bottleneck of sparse matrix operations these linear solvers can't always benefit from increasing numbers of cores. To exploit the available aggregated memory bandwidth on larger scale NUMA machines evenly distributed data is often more an issue than load balancing. Additionally, using a threading model like OpenMP, one has to ensure the data locality manually by explicit placement of memory pages. On non uniform data it is always a tradeoff between these three principles, while the ideal strategy is strongly machine-and application dependent. In this paper we want to present some benchmarks of an AMG implementation based on a new performance library. Main focus is on the comparability to state-of-the-art solver packages regarding sequential performance as well as parallel scalability on common NUMA machines. To maximize throughput on standard model problems, several thread and memory configurations have been evaluated. We will show that even on large scale multi-core architectures easy parallel M. Förster ( ) · J. Kraus programming models, like OpenMP, can achieve a competitive performance compared to more complex programming models.

show abstract

Section: Executionmentioning

confidence: 99%

Scalable parallel AMG on ccNUMA machines with OpenMP

Förster

Kraus

2011

Comput Sci Res Dev

View full text Add to dashboard Cite

show abstract

“…In [10] the authors present an study of the performance obtained (with relation to the ccNUMA memory) in a Sun Fire Server. The paper also proposes son performance tunings that improve up to 30% the application performance.…”

Section: Related Workmentioning

confidence: 99%

Impact of the Memory Hierarchy on Shared Memory Architectures in Multicore Programming Models

Badía

Perez

Ayguadé

et al. 2009

2009 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing

View full text Add to dashboard Cite

Many and multicore architectures put a big pressure in parallel programming but gives a unique opportunity to propose new programming models that automatically exploit the parallelism of these architectures. OpenMP is a very well known standard that exploits parallelism in shared memory architectures. SMPSs has recently been proposed as a task based programming model that exploits the parallelism at the task level and takes into account data dependencies between tasks. However, besides parallelism in the programming, the memory hierarchy impact in many/multi core architectures is a feature of large importance. This paper presents an evaluation of these two programming models with regard to the impact of different levels of the memory hierarchy in the duration of the application. The evaluation is based on tracefiles with hardware counters on the execution of a memory intensive benchmark in both programming models.

show abstract

“…Multi-and many-core processors exhibit even lower latencies for shared data due to on-chip cache space utilization. Earlier studies showed significant performance issues that arise from mis-handling of cache hierarchies in multi-core based systems [1]. Thus, efficient handling of address translation becomes even more crucial as this overhead may easily become the dominant factor in the overall access time for such architectures.…”

Section: Introductionmentioning

confidence: 99%

Address Translation Optimization for Unified Parallel C Multi-dimensional Arrays

Serres

Anbar

Merchant

et al. 2011

2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PHD Forum

Self Cite

View full text Add to dashboard Cite

Partitioned Global Address Space (PGAS) languages offer significant programmability advantages with its global memory view abstraction, one-sided communication constructs and data locality awareness. These attributes place PGAS languages at the forefront of possible solutions to the exploding programming complexity in the many-core architectures. To enable the shared address space abstraction, PGAS languages use an address translation mechanism while accessing shared memory to convert shared addresses to physical addresses. This mechanism is already expensive in terms of performance in distributed memory environments, but it becomes a major bottleneck in machines with shared memory support where the access latencies are significantly lower. Multi-and manycore processors exhibit even lower latencies for shared data due to on-chip cache space utilization. Thus, efficient handling of address translation becomes even more crucial as this overhead may easily become the dominant factor in the overall data access time for such architectures. To alleviate address translation overhead, this paper introduces a new mechanism targeting multi-dimensional arrays used in most scientific and image processing applications. Relative costs and the implementation details for UPC are evaluated with different workloads (matrix multiplication, Random Access benchmark and Sobel edge detection) on two different platforms: a manycore system, the TILE64 (a 64 core processor) and a dualsocket, quad-core Intel Nehalem system (up to 16 threads). Our optimization provides substantial performance improvements, up to 40x. In addition, the proposed mechanism can easily be integrated into compilers abstracting it from the programmers. Accordingly, this improves UPC productivity as it will reduce manual optimization efforts required to minimize the address translation overhead.

show abstract

Application Performance Tuning for Clusters with ccNUMA Nodes

Cited by 9 publications

References 11 publications

Scalable parallel AMG on ccNUMA machines with OpenMP

Scalable parallel AMG on ccNUMA machines with OpenMP

Impact of the Memory Hierarchy on Shared Memory Architectures in Multicore Programming Models

Address Translation Optimization for Unified Parallel C Multi-dimensional Arrays

Contact Info

Product

Resources

About