Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA

Bosilca, George; Bouteiller, Aurélien; Danalis, Anthony; Faverge, Mathieu; Haidar, Azzam; Hérault, Thomas; Kurzak, Jakub; Langou, Julien; Lemarinier, Pierre; Ltaief, Hatem; Łuszczek, Piotr; YarKhan, Asim; Dongarra, Jack

doi:10.1109/ipdps.2011.299

Cited by 132 publications

(120 citation statements)

References 21 publications

Supporting

Mentioning

114

Contrasting

Order By: Relevance

“…For every parameter that appears in the execution space of a Relation's destination, we solve the equality constraints in the conjunction of constraints for this parameter. Consider, as an example, the Relation: This way, when the run-time is processing task T b (7,8) for example, it can compute in O(1) time that it needs to send tile A [8][8] to task T a (8). Also, when processing task T b (7,11), the run-time can compute that A[11] [11] should not be sent to any instance of T a , since the condition (1 + k) == m is not true (clearly, 1 + 7 = 11).…”

Section: Interprocess Data Exchangementioning

confidence: 99%

“…be the parameters of the task that correspond to N0 /* Initiate Cycle(N0) with an empty (tautologic) */ /* Relation to self. The performance of the DAGuE run-time has been extensively studied in related publications [8,9,7]. The goal of this paper is to present the compiler front-end of the system, so we present only a summary of performance results to demonstrate that our toolchain can automatically analyze, schedule and execute non-trivial algorithms, and deliver high performance at scale.…”

Section: Function F Inalizeantidependencies(ig)mentioning

confidence: 99%

See 1 more Smart Citation

From Serial Loops to Parallel Execution on Distributed Systems

Bosilca

Bouteiller

Danalis

et al. 2012

Euro-Par 2012 Parallel Processing

Self Cite

View full text Add to dashboard Cite

Abstract. Programmability and performance portability are two major challenges in today's dynamic environment. Algorithm designers targeting efficient algorithms should focus on designing high-level algorithms exhibiting maximum parallelism, while relying on compilers and runtime systems to discover and exploit this parallelism, delivering sustainable performance on a variety of hardware. The compiler tool presented in this paper can analyze the data flow of serial codes with imperfectly nested, affine loop-nests and if statements, commonly found in scientific applications. This tool operates as the front-end compiler for the DAGuE run-time system by automatically converting serial codes into the symbolic representation of their data flow. We show how the compiler analyzes the data flow, and demonstrate that scientifically important, dense linear algebra operations can benefit from this analysis, and deliver high performance on large scale platforms.

show abstract

Section: Interprocess Data Exchangementioning

confidence: 99%

Section: Function F Inalizeantidependencies(ig)mentioning

confidence: 99%

From Serial Loops to Parallel Execution on Distributed Systems

Bosilca

Bouteiller

Danalis

et al. 2012

Euro-Par 2012 Parallel Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…[4] This architecture using many copies of the same core due to this improved total computational facility on single chip.Multi-core processors have enhanced performance and area characteristics than difficult single-core processors. [5] They propose and evaluate single-ISA heterogeneous multicore architectures as a system to reduce processor power dissipation.…”

Section: Related Workmentioning

confidence: 99%

“…Multi-core processors have improved performance and area characteristics than complex single-core processors. [5] Assess single-ISA heterogeneous multi-core architectures as a method to decrease processor power dissipation. www.ijacsa.thesai.org…”

Section: Related Workmentioning

confidence: 99%

Scalable and Flexible heterogeneous multi-core system

Padole¹

2012

IJACSA

View full text Add to dashboard Cite

Abstract-Multi-core system has wide utility in today's applications due to less power consumption and high performance. Many researchers are aiming at improving the performance of these systems by providing flexible multi-core architecture. Flexibility in the multi-core processors system provides high throughput for uniform parallel applications as well as high performance for more general work. This flexibility in the architecture can be achieved by scalable and changeablesize window micro architecture. It uses the concept of execution locality to provide large-window capabilities. Use of high memory-level parallelism (MLP) reduces the memory wall. Micro architecture contains a set of small and fast cache processors which execute high locality code. A network of small in-order memory engines use low locality code to improve performance by using instruction level parallelism (ILP). Dynamic heterogeneous multi-core architecture is capable of reconfiguring itself to fit application requirements. Study of different scalable and flexible architectures of heterogeneous multi-core system has been carried out and has been presented. Keywords-Flexible Heterogeneous MultiCore system (FMC); instruction level parallelism, thread-level parallelism; and memorylevel parallelism; scalable; chip multiprocessors (CMP).

show abstract

“…Two notable projects in this category are Charm++ [36] and PaRSEC [9], which deal with algorithms and their implementation represented as a Direct Acyclic Graph (DAG) of tasks connected with edges that communicate data between them -a concept clearly related to the dataflow paradigm. Many other systems offer similar paradigm but might not afford the same type of support for distributed memory parallelism [5,47].…”

Section: Introductionmentioning

confidence: 99%

Design and Implementation of the PULSAR Programming System for Large Scale Computing

Kurzak

Łuszczek

Yamazaki

et al. 2017

JSFI

View full text Add to dashboard Cite

The objective of the PULSAR project was to design a programming model suitable for largescale machines with complex memory hierarchies, and to deliver a prototype implementation of a runtime system supporting that model. PULSAR tackled the challenge by proposing a programming model based on systolic processing and virtualization. The PULSAR programming model is quite simple, with point-to-point channels as the main communication abstraction. The runtime implementation is very lightweight and fully distributed, and provides multithreading, messagepassing and multi-GPU offload capabilities. Performance evaluation shows good scalability up to one thousand nodes with one thousand GPU accelerators.Keywords: runtime scheduling, dataflow scheduling, distributed computing, massively parallel computing, multicore processors, hardware accelerators, virtualization, systolic arrays. Introduction MotivationHigh-end supercomputers are on the steady path of growth in size and complexity. One can get a fairly reasonable picture of the road that lies ahead by examining the platforms that will be brought online under the DOEs CORAL initiative. In 2018, the DOE aims to deploy three different CORAL platforms, each over 150 petaflop peak performance level. Two systems, named Summit and Sierra, based on the IBM OpenPOWER platform with NVIDIA GPU-accelerators, were selected for Oak Ridge National Laboratory and Lawrence Livermore National Laboratory; an Intel system, based on the Xeon Phi platform and named Aurora, was selected for Argonne National Laboratory.Summit and Sierra will follow the hybrid computing model, by coupling powerful latencyoptimized processors with highly parallel throughput-optimized accelerators. They will rely on IBM Power9 CPUs, NVIDIA Volta GPUs, and NVIDIA NVLink interconnect to connect the hybrid devices within each node, and a Mellanox Dual-Rail EDR Infiniband interconnect to connect the nodes. The Aurora system, on the contrary, will offer a more homogeneous model by utilizing the Knights Hill Xeon Phi architecture, which, unlike the current Knights Corner model, will be a stand-alone processor and not a slot-in coprocessor, and will also include integrated Omni-Path communication fabric. All platforms will benefit from recent advances in 3D-stacked memory technology.Overall, both types of systems promise major performance improvements: CPU memory bandwidth is expected to be between 200 GB/s and 300 GB/s using HMC; GPU memory bandwidth is expected to approach 1 TB/s using HBM; GPU memory capacity is expected to reach 60 GB (NVIDIA Volta); NVLink is expected to deliver no less than 80 GB/s, and possibly as high at 200 GB/s, of CPU to GPU bandwidth. In terms of computing power, the Knights Hill is expected to be between 3.6 teraFLOPS and 9 teraFLOPS, while the NVIDIA Volta is expected to be around 10 teraFLOPS. And yet, taking a wider perspective, the challenges are severe for software developers who have to extract performance from these systems. The hybrid computing model seems to be here to stay, and me...

show abstract

Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA

Cited by 132 publications

References 21 publications

From Serial Loops to Parallel Execution on Distributed Systems

From Serial Loops to Parallel Execution on Distributed Systems

Scalable and Flexible heterogeneous multi-core system

Design and Implementation of the PULSAR Programming System for Large Scale Computing

Contact Info

Product

Resources

About