2008
DOI: 10.1145/1356052.1356053
|View full text |Cite
|
Sign up to set email alerts
|

Anatomy of high-performance matrix multiplication

Abstract: We present the basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures with multilevel memories. A simple but effective algorithm for executing this operation results. Implementations on a broad selection of architectures are shown to achieve near-peak performance.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
431
0
1

Year Published

2008
2008
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 602 publications
(433 citation statements)
references
References 13 publications
1
431
0
1
Order By: Relevance
“…Following the characterization of the matrix multiplication in [5], we next analyze the performance of this operation when one of the matrix dimensions (m, n, or k) is small with respect to the other two. This gives us three different kernels: SGEPM (m is small), SGEMP (n is small), and SGEPP (k is small).…”
Section: Evaluation Of Sgemmmentioning
confidence: 99%
“…Following the characterization of the matrix multiplication in [5], we next analyze the performance of this operation when one of the matrix dimensions (m, n, or k) is small with respect to the other two. This gives us three different kernels: SGEPM (m is small), SGEMP (n is small), and SGEPP (k is small).…”
Section: Evaluation Of Sgemmmentioning
confidence: 99%
“…The FFT routines used to evaluate the non-linear terms are those of the FFTW3 library [25]. The action of the linear operators required by the linear solvers, the computation of the non-linear terms, and the Legendre transforms have been implemented as matrix-matrix products using the GotoBLAS library [26] to increase the efficiency of the codes. The block structure of some of the matrices is used to minimize the number of operations.…”
Section: Resultsmentioning
confidence: 99%
“…At first sight it could seem from (26)(27), and the values of the factors given above, that the computational cost of the Q-implicit method is much higher than for the others, and that it does not depend on the order. However, as the order increases, the predictions of the solution at the end of each step, based on extrapolation using the order of the integrator, are better, and then the number of iterations N GMR to solve the linear system during the corrections is lower.…”
Section: Resultsmentioning
confidence: 99%
“…The commands in the body of the loop map are implemented as calls to Basic Linear Algebra Subprograms (BLAS) [LHKK79,DDCHH88,DDCHD90], an interface to commonly encountered linear algebra operations, as well as other routines supported by libflame, which themselves call BLAS operations. As part of our project, we have derived a full library of these operations, but for this experiment we are depending on optimized implementations provided by the GotoBLAS2 implementation [GvdG08b,GvdG08a]. For the blocked algorithms, a block size of 128 was used.…”
Section: Performancementioning
confidence: 99%