Parallel Two-Sided Matrix Reduction to Band Bidiagonal Form on Multicore Architectures

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

et al. 2009

The emergence and continuing use of multi-core architectures require changes in the existing software and sometimes even a redesign of the established algorithms in order to take advantage of now prevailing parallelism. The Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) is a project that aims to achieve both high performance and portability across a wide range of multi-core architectures. We present in this paper a comparative study of PLASMA's performance against established linear algebra packages (LAPACK and ScaLAPACK), against new approaches at parallel execution (Task Based Linear Algebra Subroutines -TBLAS), and against equivalent commercial software offerings (MKL, ESSL and PESSL). Our experiments were conducted on one-sided linear algebra factorizations (LU, QR and Cholesky) and used multi-core architectures (based on Intel Xeon EMT64 and IBM Power6). The performance results show improvements brought by new algorithms on up to 32 cores -the largest multi-core system we could access.

Section: Discussionmentioning

confidence: 99%

Comparative study of one-sided factorizations with multiple software packages on multi-core hardware

Agullo

Hadri

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

et al. 2009

“…The reduction to symmetric band tridiagonal form can be easily derived for the upper case. All the operations will be then based on the LQ factorization numerical kernels, as described in Ltaief et al [18]. Most of the kernels from the first stage are compute-intensive and rely on Level 3 BLAS operations (i.e., matrix-matrix multiplication) to achieve high performance.…”

Section: High Performance Fine-grained and Memory-aware Kernelsmentioning

confidence: 99%

Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels

Haidar

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Dongarra

2011

This paper introduces a novel implementation in reducing a symmetric dense matrix to tridiagonal form, which is the preprocessing step toward solving symmetric eigenvalue problems. Based on tile algorithms, the reduction follows a two-stage approach, where the tile matrix is first reduced to symmetric band form prior to the final condensed structure. The challenging trade-off between algorithmic performance and task granularity has been tackled through a grouping technique, which consists of aggregating fine-grained and memory-aware computational tasks during both stages, while sustaining the applications overall high performance. A dynamic runtime environment system then schedules the different tasks in an out-of-order fashion. The performance for the tridiagonal reduction reported in this paper is unprecedented. Our implementation results in up to 50-fold and 12-fold improvement (130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000 × 24000.

“…The development of high performance DLA algorithms for homogeneous multicores has been successful in some cases, like the one-sided factorizations [4], and difficult for others, like the two-sided factorizations [5]. The situation is similar for GPUs -some algorithms map well, others are more challenging.…”

Section: Hybrid Dla Algorithmsmentioning

confidence: 99%

Dense linear algebra solvers for multicore with GPU accelerators

Tomov

Nath

2010 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum (IPDPSW)

et al. 2010

194

105

Abstract-Solving dense linear systems of equations is a fundamental problem in scientific computing. Numerical simulations involving complex systems represented in terms of unknown variables and relations between them often lead to linear systems of equations that must be solved as fast as possible. We describe current efforts toward the development of these critical solvers in the area of dense linear algebra (DLA) for multicore with GPU accelerators. We describe how to code/develop solvers to effectively use the high computing power available in these new and emerging hybrid architectures. The approach taken is based on hybridization techniques in the context of Cholesky, LU, and QR factorizations. We use a high-level parallel programming model and leverage existing software infrastructure, e.g. optimized BLAS for CPU and GPU, and LAPACK for sequential CPU processing. Included also are architecture and algorithm-specific optimizations for standard solvers as well as mixed-precision iterative refinement solvers. The new algorithms, depending on the hardware configuration and routine parameters, can lead to orders of magnitude acceleration when compared to the same algorithms on standard multicore architectures that do not contain GPU accelerators. The newly developed DLA solvers are integrated and freely available through the MAGMA library.