Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

Bilardi, Gianfranco; D'Alberto, Paolo; Nicolau, Alexandru

doi:10.1007/3-540-44688-5_3

Cited by 13 publications

(14 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Bilardi et al [7] have pointed out that it is possible to optimize memory hierarchy performance by using a Gray code order to schedule the eight sub-problems so that there is always one sub-matrix in common between successive subproblems. One such order can be found in Figure 3(a) if the sub-problems are executed in left-to-right, top-to-bottom order.…”

Section: Na -Ive Codesmentioning

confidence: 99%

An experimental comparison of cache-oblivious and cache-conscious programs

Yotov

Roeder

Pingali

et al. 2007

Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures

View full text Add to dashboard Cite

Cache-oblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm -each division step creates sub-problems of smaller size, and when the working set of a sub-problem fits in some level of the memory hierarchy, the computations in that sub-problem can be executed without suffering capacity misses at that level. In this way, divideand-conquer algorithms adapt automatically to all levels of the memory hierarchy; in fact, for problems like matrix multiplication, matrix transpose, and FFT, these recursive algorithms are optimal to within constant factors for some theoretical models of the memory hierarchy.An important question is the following: how well do carefully tuned cache-oblivious programs perform compared to carefully tuned cache-conscious programs for the same problem? Is there a price for obliviousness, and if so, how much performance do we lose? Somewhat surprisingly, there are few studies in the literature that have addressed this question.This paper reports the results of such a study in the domain of dense linear algebra. Our main finding is that in this domain, even highly optimized cache-oblivious programs perform significantly worse than corresponding cacheconscious programs. We provide insights into why this is so, and suggest research directions for making cache-oblivious algorithms more competitive.

show abstract

Section: Na -Ive Codesmentioning

confidence: 99%

An experimental comparison of cache-oblivious and cache-conscious programs

Yotov

Roeder

Pingali

et al. 2007

Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures

View full text Add to dashboard Cite

show abstract

“…For example, HASA dynamic algorithm is bound to have fewer floating point operations than Balanced, because the former apply Strassen's division more times than the latter especially for non-square matrices; however, Balanced algorithm achieves on average 1.3% execution-time reduc- tion instead HASA dynamic achieves on average 0.5%. 3 In general, Balanced presents very predictable performance with often better peak performance than HASA dynamic.…”

Section: Hp Zv6000 Athlon-64 2ghz Using Atlas Data Locality Vs Opementioning

confidence: 99%

“…GotoBLAS peak performance is 4.5 GFLOPS, Balanced and HASA dynamic peak performance is normalized to 5.4 GFLOPS (as comparison). For this architecture (with a faster memory hierarchy and processor than in Section 4.1), the recursion point is empirically found at n1 = 900 and we stop the recursion when a matrix size is smaller 3 The input set has mostly small problems, thus the average time reduction is biased towards small values. than n1.…”

Section: Gotoblas Strassen Vs Faster MMmentioning

confidence: 99%

“…In fact, they are an effective solution for the efficient utilization of (and portability across) complex and always-changing architectures (e.g., [16,11,30,19]). In this paper, we discuss a single but fundamental kernel in dense linear algebra: matrix multiply (MM) for any size and shape matrices stored in double precision and in standard row or column major layout [20,15,14,33,3,18].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Adaptive Strassen's matrix multiplication

D'Alberto

Nicolau

2007

Proceedings of the 21st Annual International Conference on Supercomputing

View full text Add to dashboard Cite

Strassen's matrix multiplication (MM) has benefits with respect to any (highly tuned) implementations of MM because Strassen's reduces the total number of operations. Strassen achieved this operation reduction by replacing computationally expensive MMs with matrix additions (MAs). For architectures with simple memory hierarchies, having fewer operations directly translates into an efficient utilization of the CPU and, thus, faster execution. However, for modern architectures with complex memory hierarchies, the operations introduced by the MAs have a limited in-cache data reuse and thus poor memory-hierarchy utilization, thereby overshadowing the (improved) CPU utilization, and making Strassen's algorithm (largely) useless on its own.In this paper, we investigate the interaction between Strassen's effective performance and the memory-hierarchy organization. We show how to exploit Strassen's full potential across different architectures. We present an easy-to-use adaptive algorithm that combines a novel implementation of Strassen's idea with the MM from automatically tuned linear algebra software (ATLAS) or GotoBLAS. An additional advantage of our algorithm is that it applies to any size and shape matrices and works equally well with row or column major layout. Our implementation consists of introducing a final step in the ATLAS/GotoBLAS-installation process that estimates whether or not we can achieve any additional speedup using our Strassen's adaptation algorithm. Then we install our codes, validate our estimates, and determine the specific performance.We show that, by the right combination of Strassen's with AT-LAS/GotoBLAS, our approach achieves up to 30%/22% speed-up versus ATLAS/GotoBLAS alone on modern high-performance single processors. We consider and present the complexity and the numerical analysis of our algorithm, and, finally, we show performance for 17 (uniprocessor) systems.

show abstract

“…In this paper, we discuss a single but fundamental algorithm in dense linear algebra: matrix multiply (MM). We propose an algorithm that automatically adapts to any architecture and applies to any size and shape matrices stored in double precision and in either row or column-major layout (i.e., our algorithm is suitable for both C and FORTRAN, algorithms using row-major order [Frens and Wise 1997;Eiron et al 1998;Whaley and Dongarra 1998;Bilardi et al 2001], and using column-major order [Higham 1990;Whaley and Dongarra 1998;Goto and van de Geijn 2008]). …”

Section: Introductionmentioning

confidence: 99%

Adaptive Winograd's matrix multiplications

D'Alberto

Nicolau

2009

ACM Trans. Math. Softw.

View full text Add to dashboard Cite

Modern architectures have complex memory hierarchies and increasing parallelism (e.g., multicores). These features make achieving and maintaining good performance across rapidly changing architectures increasingly difficult. Performance has become a complex trade-off, not just a simple matter of counting cost of simple CPU operations.We present a novel, hybrid, and adaptive recursive Strassen-Winograd's matrix multiplication (MM) that uses automatically tuned linear algebra software (ATLAS) or GotoBLAS. Our algorithm applies to any size and shape matrices stored in either row or column major layout (in double-precision in this work) and thus is efficiently applicable to both C and FORTRAN implementations. In addition, our algorithm divides the computation into equivalent in-complexity sub-MMs and does not require any extra computation to combine the intermediary sub-MM results.We achieve up to 22% execution-time reduction versus GotoBLAS/ATLAS alone for a single core system and up to 19% for a 2 dual-core processor system. Most importantly, even for small matrices such as 1500×1500, our approach attains already 10% execution-time reduction and, for MM of matrices larger than 3000×3000, it delivers performance that would correspond, for a classic O(n 3 ) algorithm, to faster-than-processor peak performance (i.e., our algorithm delivers the equivalent of 5 GFLOPS performance on a system with 4.4 GFLOPS peak performance and where GotoBLAS achieves only 4 GFLOPS). This is a result of the savings in operations (and thus FLOPS). Therefore, our algorithm is faster than any classic MM algorithms could ever be for matrices of this size. Furthermore, we present experimental evidence based on established methodologies found in the literature that our algorithm is, for a family of matrices, as accurate as the classic algorithms.

show abstract

Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

Cited by 13 publications

References 28 publications

An experimental comparison of cache-oblivious and cache-conscious programs

An experimental comparison of cache-oblivious and cache-conscious programs

Adaptive Strassen's matrix multiplication

Adaptive Winograd's matrix multiplications

Contact Info

Product

Resources

About