Optimizing Chapel for Single-Node Environments

Johnson, R. Burke; Hollingsworth, Jeffrey K.

doi:10.1109/ipdpsw.2016.181

Cited by 8 publications

(7 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Acquiring a lock is equivalent to reading the sync variable and releasing a lock is equivalent to writing to the sync variable. Similar approaches have been taken elsewhere to create arrays of OpenMP locks in Chapel [5]. This technique was functionally correct, but resulted in a significant loss of performance for our application, as discussed in Section V-D.…”

Section: A Mutex Poolmentioning

confidence: 88%

“…There has been a significant effort to evaluate and analyze the performance of Chapel programs for both single-and multi-node environments. Johnson and Hollingsworth ported and optimized several C/OpenMP based benchmarks to singlenode Chapel including LULESH, MiniMD, and CLOMP [5]. Haque and Richards implemented an optimized multi-node version of CoMD in Chapel as well as identified key limitations of Chapel in regards to scope-based code locality [6].…”

Section: Related Workmentioning

confidence: 99%

“…With a long spin-wait period, a significant spin-wait overlap occurs. Following the suggestion in a Chapel GitHub issue 5 , we shortened the amount of spin-waiting by setting QT_SPINCOUNT=300. Doing so further improved the performance of the inverse procedure on 32 threads by 2.3x.…”

Section: E Conflicts Between Qthreads and Openmpmentioning

confidence: 99%

“…With the expressiveness of high-level language constructs in Chapel, users can focus more on the algorithm they are implementing rather than low-level parallelization details. In recent work, Chapel has been shown to have competitive, and in some cases, higher performance than traditional languages and parallel libraries [4], [5], [6].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Parallel Sparse Tensor Decomposition in Chapel

Rolinger

Simon

Krieger

2018

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

In big-data analytics, using tensor decomposition to extract patterns from large, sparse multivariate data is a popular technique. Many challenges exist for designing parallel, high performance tensor decomposition algorithms due to irregular data accesses and the growing size of tensors that are processed. There have been many efforts at implementing shared-memory algorithms for tensor decomposition, most of which have focused on the traditional C/C++ with OpenMP framework. However, Chapel is becoming an increasingly popular programing language due to its expressiveness and simplicity for writing scalable parallel programs. In this work, we port a state of the art C/OpenMP parallel sparse tensor decomposition tool, SPLATT, to Chapel. We present a performance study that investigates bottlenecks in our Chapel code and discusses approaches for improving its performance. Also, we discuss features in Chapel that would have been beneficial to our porting effort. We demonstrate that our Chapel code is competitive with the C/OpenMP code for both runtime and scalability, achieving 83%-96% performance of the original code and near linear scalability up to 32 cores.

show abstract

Section: A Mutex Poolmentioning

confidence: 88%

Section: Related Workmentioning

confidence: 99%

Section: E Conflicts Between Qthreads and Openmpmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Parallel Sparse Tensor Decomposition in Chapel

Rolinger

Simon

Krieger

2018

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

show abstract

“…In addition, each of these tasks is going to execute on a remote locale using the on clause. Then, locale-specific variables are created, such as termination detection flags and state vector (lines [10][11][12]. A second coforall loop-based tasking construct is then used to exploit the intra-node parallel level, creating as many tasks as threads per locale (line 13).…”

Section: Parallel Distributed Dfs In Chapelmentioning

confidence: 99%

Parallel distributed productivity‐aware tree‐search using Chapel

Helbecque

Gmys

Melab

et al. 2023

Concurrency and Computation

View full text Add to dashboard Cite

With the recent arrival of the exascale era, modern supercomputers are increasingly big making their programming much more complex. In addition to performance, software productivity is a major concern to choose a programming language, such as Chapel, designed for exascale computing. In this paper, we investigate the design of a parallel distributed tree‐search algorithm, namely P3D‐DFS, and its implementation using Chapel. The design is based on the Chapel's DistBag data structure, revisited by: (1) redefining the data structure for Depth‐First tree‐Search (DFS), henceforth renamed DistBag‐DFS; (2) redesigning the underlying load balancing mechanism. In addition, we propose two instantiations of P3D‐DFS considering the Branch‐and‐Bound (B&B) and Unbalanced Tree Search (UTS) algorithms. In order to evaluate how much performance is traded for productivity, we compare the Chapel‐based implementations of B&B and UTS to their best‐known counterparts based on traditional OpenMP (intra‐node) and MPI+X (inter‐node). For experimental validation using 4096 processing cores, we consider the permutation flow‐shop scheduling problem for B&B and synthetic literature benchmarks for UTS. The reported results show that P3D‐DFS competes with its OpenMP baselines for coarser‐grained shared‐memory scenarios, and with its MPI+X counterparts for distributed‐memory settings, considering both performance and productivity‐awareness. In the context of this work, this makes Chapel an alternative to OpenMP/MPI+X for exascale programming.

show abstract

Data Centric Performance Measurement Techniques for Chapel Programs

Zhang

Hollingsworth

2017

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

Optimizing Chapel for Single-Node Environments

Cited by 8 publications

References 8 publications

Parallel Sparse Tensor Decomposition in Chapel

Parallel Sparse Tensor Decomposition in Chapel

Parallel distributed productivity‐aware tree‐search using Chapel

Data Centric Performance Measurement Techniques for Chapel Programs

Contact Info

Product

Resources

About