Comparative Performance and Optimization of Chapel in Modern Manycore Architectures

Kayraklioglu, Engin; Chang, Wo L.; El‐Ghazawi, Tarek

doi:10.1109/ipdpsw.2017.126

Cited by 8 publications

(8 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It supports a similar array of languages and single-node parallel models for nstream, but also supports distributed-memory parallelism (e.g. MPI and PGAS [27]- [29]).…”

Section: B Related Workmentioning

confidence: 99%

Benchmarking Fortran DO CONCURRENT on CPUs and GPUs Using BabelStream

Hammond¹,

Deakin

Cownie

et al. 2022

2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems

View full text Add to dashboard Cite

Fortran DO CONCURRENT has emerged as a new way to achieve parallel execution of loops on CPUs and GPUs. This paper studies the performance portability of this construct on a range of processors and compares it with the incumbent models: OpenMP, OpenACC and CUDA. To do this study fairly, we implemented the BabelStream memory bandwidth benchmark from scratch, entirely in modern Fortran, for all of the models considered, which include Fortran DO CONCURRENT, as well as two variants of OpenACC, four variants of OpenMP (2 CPU and 2 GPU), CUDA Fortran, and both loop-and array-based references. BabelStream Fortran matches the C++ implementation as closely as possible, and can be used to make language-based comparisons. This paper represents one of the first detailed studies of the performance of Fortran support on heterogeneous architectures; we include results for AArch64 and x86 64 CPUs as well as AMD, Intel and NVIDIA GPU platforms.

show abstract

“…It supports a similar array of languages and single-node parallel models for nstream, but also supports distributed-memory parallelism (e.g. MPI and PGAS [27]- [29]).…”

Section: B Related Workmentioning

confidence: 99%

Benchmarking Fortran DO CONCURRENT on CPUs and GPUs Using BabelStream

Hammond¹,

Deakin

Cownie

et al. 2022

2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems

View full text Add to dashboard Cite

show abstract

“…Even if some memory kinds can be configured as a cache level to enable automatic hardware-driven management (e.g. MCDRAM in Intel KNL [18,21,9]), fine-grain data allocation can lead to better performance. Thus, some research papers set up this MC-DRAM as flat mode meaning that a specific action is required to put data into this target memory.…”

Section: Related Workmentioning

confidence: 99%

Preliminary Experience with OpenMP Memory Management Implementation

Roussel

Carribault

Jaeger

2020

OpenMP: Portable Multi-Level Parallelism on Modern Systems

View full text Add to dashboard Cite

Because of the evolution of compute units, memory heterogeneity is becoming popular in HPC systems. But dealing with such various memory levels often requires different approaches and interfaces. For this purpose, OpenMP 5.0 defines memory-management constructs to offer application developers the ability to tackle the issue of exploiting multiple memory spaces in a portable way. This paper proposes an overview of memory-management from applications to runtimes. Thus, we describe a convenient way to tune an application to include memory management constructs. We also detail a methodology to integrate them into an OpenMP runtime supporting multiple memory types (DDR, MC-DRAM and NVDIMM). We implement our design into the MPC framework, while presenting some results on a realistic benchmark.

show abstract

“…where R is the rank of the decomposition and typically small (35 in our case), and the computation is fairly light. As described in a Chapel GitHub issue 4 , array slicing can be expensive due to computing and creating the domain of the resulting array view and creating and setting up the array descriptor for the view. Our first approach was to eliminate slicing by using direct 2D indexing for matrices, even though it deviated from the reference implementation of SPLATT.…”

Section: Initialmentioning

confidence: 99%

“…Each slice only consists of R elements, where R is the rank of the decomposition and typically small (35 in our case), and the computation is fairly light. As described in a Chapel GitHub issue 4 , array slicing can be expensive due to computing and creating the domain of the resulting array view and creating and setting up the array descriptor for the view.…”

Section: Mttkrp Optimizationsmentioning

confidence: 99%

See 1 more Smart Citation

Parallel Sparse Tensor Decomposition in Chapel

Rolinger

Simon

Krieger

2018

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

In big-data analytics, using tensor decomposition to extract patterns from large, sparse multivariate data is a popular technique. Many challenges exist for designing parallel, high performance tensor decomposition algorithms due to irregular data accesses and the growing size of tensors that are processed. There have been many efforts at implementing shared-memory algorithms for tensor decomposition, most of which have focused on the traditional C/C++ with OpenMP framework. However, Chapel is becoming an increasingly popular programing language due to its expressiveness and simplicity for writing scalable parallel programs. In this work, we port a state of the art C/OpenMP parallel sparse tensor decomposition tool, SPLATT, to Chapel. We present a performance study that investigates bottlenecks in our Chapel code and discusses approaches for improving its performance. Also, we discuss features in Chapel that would have been beneficial to our porting effort. We demonstrate that our Chapel code is competitive with the C/OpenMP code for both runtime and scalability, achieving 83%-96% performance of the original code and near linear scalability up to 32 cores.

show abstract

Comparative Performance and Optimization of Chapel in Modern Manycore Architectures

Cited by 8 publications

References 21 publications

Benchmarking Fortran DO CONCURRENT on CPUs and GPUs Using BabelStream

Benchmarking Fortran DO CONCURRENT on CPUs and GPUs Using BabelStream

Preliminary Experience with OpenMP Memory Management Implementation

Parallel Sparse Tensor Decomposition in Chapel

Contact Info

Product

Resources

About