2008
DOI: 10.1145/1356052.1356055
|View full text |Cite
|
Sign up to set email alerts
|

Cache efficient bidiagonalization using BLAS 2.5 operators

Abstract: On cache based computer architectures using current standard algorithms, Householder bidiagonalization requires a significant portion of the execution time for computing matrix singular values and vectors. In this paper we reorganize the sequence of operations for Householder bidiagonalization of a general m × n matrix, so that two ( GEMV) vector-matrix multiplications can be done with one pass of the unreduced trailing part of the matrix through cache. Two new BLAS 2.5 operations approximately cut in half the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
39
0

Year Published

2008
2008
2019
2019

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 24 publications
(39 citation statements)
references
References 21 publications
0
39
0
Order By: Relevance
“…However, there is potential for speedup via this operation, too. The authors of [Van Zee et al 2012], building on the efforts of [Howell et al 2008], report on an implementation of reduction to bidiagonal form that is 60% faster, asymptotically, than the reference implementation provided by netlib LAPACK. For cases where m = n, we found the bidiagonal reduction to constitute anywhere from 40 to 60% of the total SVD run time when using the restructured QR algorithm.…”
Section: General Singular Value Decompositionmentioning
confidence: 99%
“…However, there is potential for speedup via this operation, too. The authors of [Van Zee et al 2012], building on the efforts of [Howell et al 2008], report on an implementation of reduction to bidiagonal form that is 60% faster, asymptotically, than the reference implementation provided by netlib LAPACK. For cases where m = n, we found the bidiagonal reduction to constitute anywhere from 40 to 60% of the total SVD run time when using the restructured QR algorithm.…”
Section: General Singular Value Decompositionmentioning
confidence: 99%
“…Computer scientists apply tuning techniques to improve data locality and create highly efficient implementations of the Basic Linear Algebra Subprograms (BLAS) [5,18,23,28,49] and LAPACK [6], enabling scientists to build high-performance software at reduced cost. While tuned libraries for the level 3 BLAS and LAPACK routines perform at or near machine peak, level 1 and 2 BLAS routines, in which there is less data reuse, achieve only a fraction of peak [27]. However, sequences of level 1 and 2 BLAS routines appear in many scientific applications and these sequences represent further opportunities for tuning.…”
Section: Introductionmentioning
confidence: 99%
“…The main optimization technique they use is blocking to improve the reuse of data in caches, registers, and the TLB (Goto and van de Geijn, 2008). However, for the BLAS level 1 and 2 operations, which have a lower ratio of floatingpoint operations to memory accesses, performance is a fraction of peak due to bandwidth limitations (Howell et al, 2008).…”
Section: Introductionmentioning
confidence: 99%
“…Scientific applications often require sequences of BLAS level 1 and 2 operations and many researchers have observed that such sequences, when implemented as a single specialized routine, can be optimized to reduce memory traffic (Baker et al, 2006;Howell et al, 2008;Vuduc et al, 2003). This phenomenon motived the recent addition of kernels such as GEMVER and GEMVT to the BLAS (Blackford et al, 2002) and their use in Householder bidiagonalization in LA-PACK (Howell et al, 2008).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation