Anatomy of high-performance matrix multiplication

Goto, Kazushige; Geijn, Robert A.

doi:10.1145/1356052.1356053

Cited by 602 publications

(433 citation statements)

References 13 publications

Supporting

Mentioning

431

Contrasting

Unclassified

Order By: Relevance

“…Following the characterization of the matrix multiplication in [5], we next analyze the performance of this operation when one of the matrix dimensions (m, n, or k) is small with respect to the other two. This gives us three different kernels: SGEPM (m is small), SGEMP (n is small), and SGEPP (k is small).…”

Section: Evaluation Of Sgemmmentioning

confidence: 99%

Evaluation and tuning of the Level 3 CUBLAS for graphics processors

Barrachina

Dolz

Igual

et al. 2008

2008 IEEE International Symposium on Parallel and Distributed Processing

View full text Add to dashboard Cite

show abstract

Section: Evaluation Of Sgemmmentioning

confidence: 99%

Evaluation and tuning of the Level 3 CUBLAS for graphics processors

Barrachina

Dolz

Igual

et al. 2008

2008 IEEE International Symposium on Parallel and Distributed Processing

View full text Add to dashboard Cite

show abstract

“…The FFT routines used to evaluate the non-linear terms are those of the FFTW3 library [25]. The action of the linear operators required by the linear solvers, the computation of the non-linear terms, and the Legendre transforms have been implemented as matrix-matrix products using the GotoBLAS library [26] to increase the efficiency of the codes. The block structure of some of the matrices is used to minimize the number of operations.…”

Section: Resultsmentioning

confidence: 99%

“…At first sight it could seem from (26)(27), and the values of the factors given above, that the computational cost of the Q-implicit method is much higher than for the others, and that it does not depend on the order. However, as the order increases, the predictions of the solution at the end of each step, based on extrapolation using the order of the integrator, are better, and then the number of iterations N GMR to solve the linear system during the corrections is lower.…”

Section: Resultsmentioning

confidence: 99%

A comparison of high-order time integrators for thermal convection in rotating spherical shells

García

Net

García-Archilla

et al. 2010

Journal of Computational Physics

View full text Add to dashboard Cite

A numerical study of several time integration methods for solving the threedimensional Boussinesq thermal convection equations in rotating spherical shells is presented. Implicit and semi-implicit time integration techniques based on backward differentiation and extrapolation formulae are considered. The use of Krylov techniques allows the implicit treatment of the Coriolis term with low storage requirements. The codes are validated with a known benchmark, and their efficiency is studied. The results show that the use of high order methods, especially those with time step and order control, increase the efficiency of the time integration, and allows to obtain more accurate solutions.

show abstract

“…The commands in the body of the loop map are implemented as calls to Basic Linear Algebra Subprograms (BLAS) [LHKK79,DDCHH88,DDCHD90], an interface to commonly encountered linear algebra operations, as well as other routines supported by libflame, which themselves call BLAS operations. As part of our project, we have derived a full library of these operations, but for this experiment we are depending on optimized implementations provided by the GotoBLAS2 implementation [GvdG08b,GvdG08a]. For the blocked algorithms, a block size of 128 was used.…”

Section: Performancementioning

confidence: 99%

Deriving dense linear algebra libraries

Bientinesi

Gunnels²,

Myers

et al. 2013

Form. Asp. Comput.

View full text Add to dashboard Cite

Abstract. Starting in the late 1960s computer scientists including Dijkstra and Hoare advocated goaloriented programming and the formal derivation of algorithms. The chief impediment to realizing this for loop-based programs was that a priori determination of loop-invariants, a prerequisite for developing loops, was a task too complex for any but the simplest of operations. Around 2000, these techniques were for the first time successfully applied to the domain of high-performance dense linear algebra libraries. This has led to a multitude of papers, mostly published in the ACM Transactions for Mathematical Software, a system for the mechanical derivation of algorithms, and a high-performance linear algebra library, libflame, that includes more than a thousand variants of algorithms for more than a hundred linear algebra operations. To our knowledge, this success story has unfolded with limited awareness on the part the formal methods community. This paper reports on ten years of experience and is meant to raise that awareness.

show abstract

Anatomy of high-performance matrix multiplication

Cited by 602 publications

References 13 publications

Evaluation and tuning of the Level 3 CUBLAS for graphics processors

Evaluation and tuning of the Level 3 CUBLAS for graphics processors

A comparison of high-order time integrators for thermal convection in rotating spherical shells

Deriving dense linear algebra libraries

Contact Info

Product

Resources

About