Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators

Dongarra, Jack; Gates, Mark; Kurzak, Jakub; Łuszczek, Piotr; Tsai, Yaohung M.

doi:10.1109/jproc.2018.2868961

Cited by 13 publications

(5 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another promising direction is automatically auto-tuning the algorithmic parameters of a program based upon the downstream choice of hardware. This facilitates easier deployment by tailoring the program to achieve good performance and load balancing on a variety of hardware (Dongarra et al, 2018;Clint Whaley et al, 2001;Asanović et al, 2006;Ansel et al, 2014).…”

Section: A Software Revolutionmentioning

confidence: 99%

The Hardware Lottery

Hooker¹

2020

Preprint

View full text Add to dashboard Cite

Hardware, systems and algorithms research communities have historically had different incentive structures and fluctuating motivation to engage with each other explicitly. This historical treatment is odd given that hardware and software have frequently determined which research ideas succeed (and fail). This essay introduces the term hardware lottery to describe when a research idea wins because it is suited to the available software and hardware and not because the idea is superior to alternative research directions. Examples from early computer science history illustrate how hardware lotteries can delay research progress by casting successful ideas as failures. These lessons are particularly salient given the advent of domain specialized hardware which make it increasingly costly to stray off of the beaten path of research ideas. This essay posits that the gains from progress in computing are likely to become even more uneven, with certain research directions moving into the fast-lane while progress on others is further obstructed.

show abstract

Section: A Software Revolutionmentioning

confidence: 99%

The Hardware Lottery

Hooker¹

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Given the diverse, evolving, and possibly heterogeneous architectures on which software must run, automatic ways to select the various algorithmic parameters will be increasingly needed in order to achieve good performance, energy efficiency, load balancing, and so on. Autotuning is already routinely used for core numerical linear algebra algorithms; see, e.g., [100], [101], and the references therein.…”

Section: Towards Hpc's Next Scalementioning

confidence: 99%

“…Autotuning is already routinely used for core numerical linear algebra algorithms, see, e.g. [99,100], and references therein.…”

Section: (A) Asynchronous Algorithmsmentioning

confidence: 99%

Numerical algorithms for high-performance computational science

Dongarra

Grigori

Higham

2020

Phil. Trans. R. Soc. A.

Self Cite

View full text Add to dashboard Cite

A number of features of today’s high-performance computers make it challenging to exploit these machines fully for computational science. These include increasing core counts but stagnant clock frequencies; the high cost of data movement; use of accelerators (GPUs, FPGAs, coprocessors), making architectures increasingly heterogeneous; and multi- ple precisions of floating-point arithmetic, including half-precision. Moreover, as well as maximizing speed and accuracy, minimizing energy consumption is an important criterion. New generations of algorithms are needed to tackle these challenges. We discuss some approaches that we can take to develop numerical algorithms for high-performance computational science, with a view to exploiting the next generation of supercomputers. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

show abstract

“…In numerical linear algebra the term "batched computation" is well-established, signifying a simultaneous processing of a large quantity of relatively small problems, e.g., the LU and the Cholesky factorizations [15] and the corresponding linear system solving [16] on the GPUs, with appropriate data layouts. It is therefore both justifiable and convenient to reuse the term in the present context.…”

Section: Introductionmentioning

confidence: 99%

Batched computation of the singular value decompositions of order two by the AVX-512 vectorization

Novaković

2020

Preprint

View full text Add to dashboard Cite

In this paper a vectorized algorithm for simultaneously computing up to eight singular value decompositions (SVDs, each of the form A = U ΣV * ) of real or complex matrices of order two is proposed. The algorithm extends to a batch of matrices of an arbitrary length n, that arises, for example, in the annihilation part of the parallel Kogbetliantz algorithm for the SVD of a square matrix of order 2n. The SVD algorithm for a single matrix of order two is derived first. It scales, in most instances error-free, the input matrix A such that its singular values Σ ii cannot overflow whenever its elements are finite, and then computes the URV factorization of the scaled matrix, followed by the SVD of a non-negative upper-triangular middle factor. A vector-friendly data layout for the batch is then introduced, where the same-indexed elements of each of the input and the output matrices form vectors, and the algorithm's steps over such vectors are described. The vectorized approach is then shown to be about three times faster than processing each matrix in isolation, while slightly improving accuracy over the straightforward method for the 2 × 2 SVD.

show abstract

Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators

Cited by 13 publications

References 29 publications

The Hardware Lottery

The Hardware Lottery

Numerical algorithms for high-performance computational science

Batched computation of the singular value decompositions of order two by the AVX-512 vectorization

Contact Info

Product

Resources

About