Cache efficient bidiagonalization using BLAS 2.5 operators

Howell, Gary W.; Demmel, James; Fulton, Charles T.; Hammarling, Sven; Marmol, Karen

doi:10.1145/1356052.1356055

Cited by 24 publications

(39 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, there is potential for speedup via this operation, too. The authors of [Van Zee et al 2012], building on the efforts of [Howell et al 2008], report on an implementation of reduction to bidiagonal form that is 60% faster, asymptotically, than the reference implementation provided by netlib LAPACK. For cases where m = n, we found the bidiagonal reduction to constitute anywhere from 40 to 60% of the total SVD run time when using the restructured QR algorithm.…”

Section: General Singular Value Decompositionmentioning

confidence: 99%

Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

Zee

Geijn

Quintana-Ortí

2014

ACM Trans. Math. Softw.

View full text Add to dashboard Cite

We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they become rich in operations that can achieve near-peak performance on a modern processor. The key is a novel, cache-friendly algorithm for applying multiple sets of Givens rotations to the eigenvector/singular vector matrix. This algorithm is then implemented with optimizations that (1) leverage vector instruction units to increase floating-point throughput, and (2) fuse multiple rotations to decrease the total number of memory operations. We demonstrate the merits of these new QR algorithms for computing the Hermitian eigenvalue decomposition (EVD) and singular value decomposition (SVD) of dense matrices when all eigenvectors/singular vectors are computed. The approach yields vastly improved performance relative to the traditional QR algorithms for these problems and is competitive with two commonly used alternativesCuppen's Divide and Conquer algorithm and the Method of Multiple Relatively Robust Representationswhile inheriting the more modest O(n) workspace requirements of the original QR algorithms. Since the computations performed by the restructured algorithms remain essentially identical to those performed by the original methods, robust numerical properties are preserved.

show abstract

Section: General Singular Value Decompositionmentioning

confidence: 99%

Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

Zee

Geijn

Quintana-Ortí

2014

ACM Trans. Math. Softw.

View full text Add to dashboard Cite

show abstract

“…Computer scientists apply tuning techniques to improve data locality and create highly efficient implementations of the Basic Linear Algebra Subprograms (BLAS) [5,18,23,28,49] and LAPACK [6], enabling scientists to build high-performance software at reduced cost. While tuned libraries for the level 3 BLAS and LAPACK routines perform at or near machine peak, level 1 and 2 BLAS routines, in which there is less data reuse, achieve only a fraction of peak [27]. However, sequences of level 1 and 2 BLAS routines appear in many scientific applications and these sequences represent further opportunities for tuning.…”

Section: Introductionmentioning

confidence: 99%

Automating the generation of composed linear algebra kernels

Belter

Jessup

Karlin

et al. 2009

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

View full text Add to dashboard Cite

Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However, tuning the BLAS in isolation misses opportunities for memory optimization that result from composing multiple subprograms. Because it is not practical to create a library of all BLAS combinations, we have developed a domain-specific compiler that generates them on demand. In this paper, we describe a novel algorithm for compiling linear algebra kernels and searching for the best combination of optimization choices. We also present a new hybrid analytic/empirical method for quickly evaluating the profitability of each optimization. We report experimental results showing speedups of up to 130% relative to the GotoBLAS on an AMD Opteron and up to 137% relative to MKL on an Intel Core 2.

show abstract

“…The main optimization technique they use is blocking to improve the reuse of data in caches, registers, and the TLB (Goto and van de Geijn, 2008). However, for the BLAS level 1 and 2 operations, which have a lower ratio of floatingpoint operations to memory accesses, performance is a fraction of peak due to bandwidth limitations (Howell et al, 2008).…”

Section: Introductionmentioning

confidence: 99%

“…Scientific applications often require sequences of BLAS level 1 and 2 operations and many researchers have observed that such sequences, when implemented as a single specialized routine, can be optimized to reduce memory traffic (Baker et al, 2006;Howell et al, 2008;Vuduc et al, 2003). This phenomenon motived the recent addition of kernels such as GEMVER and GEMVT to the BLAS (Blackford et al, 2002) and their use in Householder bidiagonalization in LA-PACK (Howell et al, 2008).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Build to order linear algebra kernels

Siek

Karlin

Jessup

2008

2008 IEEE International Symposium on Parallel and Distributed Processing

View full text Add to dashboard Cite

Abstract-The performance bottleneck for many scientific applications is the cost of memory access inside linear algebra kernels. Tuning such kernels for memory efficiency is a complex task that degrades the productivity of computational scientists. Software libraries such as the Basic Linear Algebra Subprograms (BLAS) ameliorate this problem by providing a standard interface for which computer scientists and hardware vendors have created highly-tuned implementations. Scientific applications often require a sequence of BLAS operations, which presents further opportunities for memory optimization. However, because BLAS are tuned in isolation they do not take advantage of these opportunities. This phenomenon motivated the recent addition of several routines to the BLAS that each perform a sequence operations. Unfortunately, the exact sequence of operations needed in a given situation is highly application dependent, so many more such routines are needed.In this paper we present preliminary work on a domainspecific compiler that generates implementations for arbitrary sequences of basic linear algebra operations and tunes them for memory efficiency. We report experimental results for dense kernels, showing performance speedups of 15 to 120% relative to sequences of calls to GotoBLAS and vendor-tuned BLAS on Intel Xeon and IBM PowerPC platforms.

show abstract

Cache efficient bidiagonalization using BLAS 2.5 operators

Cited by 24 publications

References 21 publications

Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

Automating the generation of composed linear algebra kernels

Build to order linear algebra kernels

Contact Info

Product

Resources

About