Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems

Agullo, Emmanuel; Buttari, Alfredo; Guermouche, Abdou; Lopez, Florent

doi:10.1145/2898348

Cited by 41 publications

(58 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…They show the scalability of qrm parsec using both the 1D and 2D front factorization algorithms; the speedups are computed with respect to the sequential running time reported in Table 1. These are compared to the results obtained with an equivalent implementation based on the Sequential Task Flow model and the StarPU runtime system [2]. The results show that qrm parsec achieves a satisfactory performance on all the tested matrices, including the smallest ones (on the left side of the plot) with speedups close to 20 (out of 24) for the largest size ones.…”

Section: Early Experimental Resultsmentioning

confidence: 97%

“…The runs were performed on the Dude system which is a shared-memory machine equipped with four AMD Opteron(tm) Processor 8431 (six cores) and 72 GB of memory. As a reference, we also report on the performance of the STF implementation of the solver from [2], which is supported with StarPU and named qrm starpu below. The experimental results are presented in Figure 4.…”

Section: Early Experimental Resultsmentioning

confidence: 99%

“…As mentioned above, the popularity of this model encouraged the OpenMP board to include it in the 4.0 standard. The simplicity of the STF model facilitates the design of numerical algorithms in a concise manner and can be exploited to efficiently target multicore architectures [2].…”

Section: Parallel Programming Models For Task-based Algorithmsmentioning

confidence: 99%

See 2 more Smart Citations

Exploiting a Parametrized Task Graph Model for the Parallelization of a Sparse Direct Multifrontal Solver

Agullo

Bosilca

Buttari

et al. 2017

Euro-Par 2016: Parallel Processing Workshops

Self Cite

View full text Add to dashboard Cite

Abstract. The advent of multicore processors requires to reconsider the design of high performance computing libraries to embrace portable and effective techniques of parallel software engineering. One of the most promising approaches consists in abstracting an application as a directed acyclic graph (DAG) of tasks. While this approach has been popularized for shared memory environments by the OpenMP 4.0 standard where dependencies between tasks are automatically inferred, we investigate an alternative approach, capable of describing the DAG of task in a distributed setting, where task dependencies are explicitly encoded. So far this approach has been mostly used in the case of algorithms with a regular data access pattern and we show in this study that it can be efficiently applied to a higly irregular numerical algorithm such as a sparse multifrontal QR method. We present the resulting implementation and discuss the potential and limits of this approach in terms of productivity and effectiveness in comparison with more common parallelization techniques. Although at an early stage of development, preliminary results show the potential of the parallel programming model that we investigate in this work.

show abstract

Section: Early Experimental Resultsmentioning

confidence: 97%

Section: Early Experimental Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Exploiting a Parametrized Task Graph Model for the Parallelization of a Sparse Direct Multifrontal Solver

Agullo

Bosilca

Buttari

et al. 2017

Euro-Par 2016: Parallel Processing Workshops

Self Cite

View full text Add to dashboard Cite

show abstract

“…This approach is complex and usually requires completely rewriting an application. The second method is the sequential task flow (STF) (Agullo et al, 2016b). Here, a single thread creates the tasks by informing the RS about the access of each of them on the data.…”

Section: Task-based Parallelizationmentioning

confidence: 99%

Increasing the degree of parallelism using speculative execution in task-based runtime systems

Bramas

2019

PeerJ Computer Science

View full text Add to dashboard Cite

Task-based programming models have demonstrated their efficiency in the development of scientific applications on modern high-performance platforms. They allow delegation of the management of parallelization to the runtime system (RS), which is in charge of the data coherency, the scheduling, and the assignment of the work to the computational units. However, some applications have a limited degree of parallelism such that no matter how efficient the RS implementation, they may not scale on modern multicore CPUs. In this paper, we propose using speculation to unleash the parallelism when it is uncertain if some tasks will modify data, and we formalize a new methodology to enable speculative execution in a graph of tasks. This description is partially implemented in our new C++ RS called SPETABARU, which is capable of executing tasks in advance if some others are not certain to modify the data. We study the behavior of our approach to compute Monte Carlo and replica exchange Monte Carlo simulations. Subjects Distributed and Parallel ComputingHow to cite this article Bramas B. 2019. Increasing the degree of parallelism using speculative execution in task-based runtime systems. PeerJ Comput. Sci. 5:e183 http://doi.org/10.7717/peerj-cs.183 Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 2/25 Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 3/25 ALGORITHM 2: Replica Exchange Monte Carlo (parallel tempering) simulation algorithm. 1 function REMC(domains[N], temperature[N]) 2 // Compute energy (particle to particle interactions) 3 for s from 1 to N do 4 energy[s] ← compute_energy(domains[s]) 5 end 6 // Iterate for a given number of iterations 7 for iter from 1 to NB_LOOPS_REMC do 8 for s from 1 to N do 9 // Compute usual MC for each simulation 10 MC_Core(domains[s], temperature[s], energy[s]) 11 end 12 // Compare based on a given strategy 13 for s in exchange_list(iter) do 14 // Use the energy difference between s and s+1 to decide to exchange them 15 if random_01() ≤ metropolis(energy[s] -energy[s+1], temperatures[s]) then 16 swap(domains[s], domains[s+1]) 17 swap(energy[s], energy[s+1]) 18 end 19 end 20 end Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 6/25 Thachuk C, Shmygelska A, Hoos HH. 2007. A replica exchange Monte Carlo algorithm for protein folding in the HP model. BMC Bioinformatics 8(1):342 Nikolopoulos DS. 2018. A taxonomy of task-based parallel programming technologies for highperformance computing. The Journal of Supercomputing 74(4):1422-1434. Tillenius M. 2015. Superglue: a shared memory framework using data versioning for dependency-aware task-based parallelization. SIAM Journal on Scientific Computing 37(6):C617-C642 X. 2013. Parallel metropolis coupled Markov chain Monte Carlo for isolation with migration model. Applied Mathematics & Information Sciences 7(1L):219-224 DOI 10.12785/amis/071L30.

show abstract

“…The StarPU runtime has been configured with the PRIO scheduler (with a central queue on each node, sorting tasks by priorities given by the developer) and dedicates, on each node, one core for task submission (using the Sequential Task Flow paradigm 45 ) and another core to handle MPI operations. As an introductory illustration, we consider the Chameleon/Cholesky decomposition of an input matrix of dimension 72,000, divided in 75 × 75 tiles of size 960 (ie, with 75 dpotrf tasks), executed on two nodes comprising five CPU and two GPU workers each, and interconnected through a 10 Gb/s Ethernet network.…”

Section: Visualization Panelsmentioning

confidence: 99%

A visual performance analysis framework for task‐based parallel applications running on hybrid clusters

Pinto

Schnorr

Stanisic

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

Programming paradigms in High-Performance Computing have been shifting toward task-based models that are capable of adapting readily to heterogeneous and scalable supercomputers. The performance of task-based application heavily depends on the runtime scheduling heuristics and on its ability to exploit computing and communication resources. Unfortunately, the traditional performance analysis strategies are unfit to fully understand task-based runtime systems and applications: they expect a regular behavior with communication and computation phases, while task-based applications demonstrate no clear phases. Moreover, the finer granularity of task-based applications typically induces a stochastic behavior that leads to irregular structures that are difficult to analyze. Furthermore, the combination of application structure, scheduler, and hardware information is generally essential to understand performance issues. This paper presents a flexible framework that enables one to combine several sources of information and to create custom visualization panels allowing to understand and pinpoint performance problems incurred by bad scheduling decisions in task-based applications. Three case-studies using StarPU-MPI, a task-based multi-node runtime system, are detailed to show how our framework can be used to study the performance of the well-known Cholesky factorization. Performance improvements include a better task partitioning among the multi-(GPU, core) to get closer to theoretical lower bounds, improved MPI pipelining in multi-(node, core, GPU) to reduce the slow start, and changes in the runtime system to increase MPI bandwidth, with gains of up to 13% in the total makespan.

show abstract

Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems

Cited by 41 publications

References 32 publications

Exploiting a Parametrized Task Graph Model for the Parallelization of a Sparse Direct Multifrontal Solver

Exploiting a Parametrized Task Graph Model for the Parallelization of a Sparse Direct Multifrontal Solver

Increasing the degree of parallelism using speculative execution in task-based runtime systems

A visual performance analysis framework for task‐based parallel applications running on hybrid clusters

Contact Info

Product

Resources

About