A Family of High-Performance Matrix Multiplication Algorithms

Self Cite

We show how to compute an LU factorization of a matrix when the factors of a leading principle submatrix are already known. The approach incorporates pivoting akin to partial pivoting, a strategy we call incremental pivoting. An implementation using the Formal Linear Algebra Methods Environment (FLAME) Application Programming Interface (API) is described. Experimental results demonstrate practical numerical stability and high performance on an Intel Itanium2 processor based server.

Section: Blocked Right-looking Lu Factorizationmentioning

confidence: 99%

Updating an LU Factorization with Pivoting

Quintana–Ort́ı

Geijn

2008

Self Cite

“…Thus, it reveals that autotuning is unnecessary for the operation that has been tauted by the autotuning community as the example of the success of autotuning. The problem with that work ( [Yotov et al 2005]) is that the ATLAS approach to optimizing gemm had been previously shown to be suboptimal, first in theory [Gunnels et al 2001] and then in practice [Goto and van de Geijn 2008b]. Furthermore, ATLAS leverages an inner kernel optimized by a human expert, which still involves a substantial manual encoding.…”

Section: Introductionmentioning

confidence: 99%

Analytical Modeling Is Enough for High-Performance BLIS

Low

Igual

Smith

et al. 2016

126

115

We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine optimal tuning parameters for high-end instantiations of the matrix-matrix multiplication. This is of both practical and scientific importance, as it greatly reduces the development effort required for the implementation of the level-3 BLAS while also advancing our understanding of how hierarchically layered memories interact with high performance software. This allows the community to move on from valuable engineering solutions (empirically autotuning) to scientific understanding (analytical insight).

“…-As mentioned previously, given that BLIS isolates performance-sensitive code to a few simple kernels, the framework may aid those who wish to automate the generation of high-performance linear algebra libraries from domain and hardware specifications [Püschel et al 2005;Marker et al 2012;Siek et al 2008;Belter et al 2009]. -As computing systems become less reliable, whether because of quantum physical effects, power consumption restrictions, or outright power failures, the community may become increasingly interested in adding algorithmic fault-tolerance to the BLAS (or BLAS-equivalent) layer of the dense linear algebra software stack [Gunnels et al 2001b;Huang and Abraham 1984]. We plan to investigate the suitability of BLIS as a vehicle to provide such fault-tolerance.…”

Section: Discussionmentioning

confidence: 99%

“…This idea is not new van de Geijn 2008a, 2008b;Gunnels et al 2001b;Whaley and Dongarra 1998. Section 5 discusses how these level-3 operations are implemented in the BLIS framework so that flexibility (i.e., generality), portability, and high performance are simultaneously achieved.…”

Section: Level-3: Matrix-matrix Operationsmentioning

confidence: 99%

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

Zee

Geijn

2015

Self Cite

254

153

The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-vector) and level-3 (matrix-matrix) BLAS operations can be expressed and optimized in terms of very simple kernels. While others have had similar insights, BLIS reduces the necessary kernels to what we believe is the simplest set that still supports the high performance that the computational science community demands. Higher-level framework code is generalized and implemented in ISO C99 so that it can be reused and/or reparameterized for different operations (and different architectures) with little to no modification. Inserting high-performance kernels into the framework facilitates the immediate optimization of any BLAS-like operations which are cast in terms of these kernels, and thus the framework acts as a productivity multiplier. Users of BLAS-dependent applications are given a choice of using the traditional Fortran-77 BLAS interface, a generalized C interface, or any other higher level interface that builds upon this latter API. Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).