Composable Parallel Patterns with Intel Cilk Plus

Robison, Arch D.

doi:10.1109/mcse.2013.21

Cited by 48 publications

(25 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…OpenMP [10], which we consider a language extension, integrates tasks into its programming interface since version 3.0. In addition to languages and extensions, industry-standard and wellsupported parallel libraries based on task parallelism have emerged, such as Intel Cilk Plus [23] or Intel TBB [28]. There are also runtimes specifically designed to improve shared memory performance of existing language extensions, such as Qthreads [27] or Argobots [24]; this topic is of significant importance, considering the increase in many-core processors in recent years and, consequently, the importance of efficient lightweight runtimes.…”

Section: Stefano Markidis Markidis@kthsementioning

confidence: 99%

A taxonomy of task-based parallel programming technologies for high-performance computing

et al. 2018

View full text Add to dashboard Cite

Task-based programming models for shared memory-such as Cilk Plus and OpenMP 3-are well established and documented. However, with the increase in parallel, many-core, and heterogeneous systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing (HPC), no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We

show abstract

Section: Stefano Markidis Markidis@kthsementioning

confidence: 99%

A taxonomy of task-based parallel programming technologies for high-performance computing

et al. 2018

View full text Add to dashboard Cite

show abstract

“…Cilk Plus offers several powerful extensions to C/C++ that allow to express both task and data parallelism [5,8,13,14]. The most important constructs are useful to specify and handle possible parallel execution of tasks.…”

Section: Tbb (Threading Building Blocks)mentioning

confidence: 99%

Language-based vectorization and parallelization using intrinsics, OpenMP, TBB and Cilk Plus

Stpiczyński

2018

J Supercomput

View full text Add to dashboard Cite

The aim of this paper is to evaluate OpenMP, TBB and Cilk Plus as basic language-based tools for simple and efficient parallelization of recursively defined computational problems and other problems that need both task and data parallelization techniques. We show how to use these models of parallel programming to transform a source code of Adaptive Simpson's Integration to programs that can utilize multiple cores of modern processors. Using the example of Belman-Ford algorithm for solving single-source shortest path problems, we advise how to improve performance of data parallel algorithms by tuning data structures for better utilization of vector extensions of modern processors. Manual vectorization techniques based on Cilk array notation and intrinsics are presented. We also show how to simplify such optimization using Intel SIMD Data Layout Template containers.

show abstract

“…The combination of MPI with such a model referred to as an MPI+X approach. Example "+X" software environments include OpenMP, Pthreads, CUDA, OpenCL, CilkPlus, Threading Building Blocks, and Microsoft's Task Parallel Library [8][9][10][11][12][13][14]. Examples of hardware that supports these environments include conventional multicore CPU chips, Intel's Xeon Phi, and GPUs or FPGAs used as coprocessors.…”

Section: Kokkos For Portabilitymentioning

confidence: 99%

An MPI+$$X$$ implementation of contact global search using Kokkos

Hansen

Xavier

Mish

et al. 2015

Engineering with Computers

View full text Add to dashboard Cite

of the external surfaces of bodies within the domain in an attempt to efficiently distribute computational work. This decomposition may or may not be the same as the volume decomposition associated with the host physics. The parallel contact global search phase is then employed to find and distribute surface entities (nodes and faces) that are needed to compute contact constraints between entities owned by different MPI ranks without further inter-rank communication. Key steps of the contact global search include computing bounding boxes, building surface entity (node and face) search trees and finding and distributing entities required to complete on-rank (local) spatial searches. To enable source-code portability and performance across a variety of different computer architectures, we implemented the algorithm using the Kokkos hardware abstraction library. While we targeted development towards machines with a GPU accelerator per MPI rank, we also report performance results for OpenMP with a conventional multi-core compute node per rank. Results here demonstrate a 47 % decrease in the time spent within the global search algorithm, comparing the reference ACME algorithm with the GPU implementation, on an 18M face problem using four MPI ranks. While further work remains to maximize performance on the GPU, this result illustrates the potential of the proposed implementation.Abstract This paper describes an approach that seeks to parallelize the spatial search associated with computational contact mechanics. In contact mechanics, the purpose of the spatial search is to find "nearest neighbors," which is the prelude to an imprinting search that resolves the interactions between the external surfaces of contacting bodies. In particular, we are interested in the contact global search portion of the spatial search associated with this operation on domain-decomposition-based meshes. Specifically, we describe an implementation that combines standard domain-decomposition-based MPI-parallel spatial search with thread-level parallelism (MPI-X) available on advanced computer architectures (those with GPU coprocessors). Our goal is to demonstrate the efficacy of the MPI-X paradigm in the overall contact search. Standard MPI-parallel implementations typically use a domain decomposition

show abstract

Composable Parallel Patterns with Intel Cilk Plus

Cited by 48 publications

References 2 publications

A taxonomy of task-based parallel programming technologies for high-performance computing

A taxonomy of task-based parallel programming technologies for high-performance computing

Language-based vectorization and parallelization using intrinsics, OpenMP, TBB and Cilk Plus

An MPI+$$X$$ implementation of contact global search using Kokkos

Contact Info

Product

Resources

About