2001
DOI: 10.1007/3-540-44688-5_3
|View full text |Cite
|
Sign up to set email alerts
|

Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2001
2001
2009
2009

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 13 publications
(14 citation statements)
references
References 28 publications
0
14
0
Order By: Relevance
“…Bilardi et al [7] have pointed out that it is possible to optimize memory hierarchy performance by using a Gray code order to schedule the eight sub-problems so that there is always one sub-matrix in common between successive subproblems. One such order can be found in Figure 3(a) if the sub-problems are executed in left-to-right, top-to-bottom order.…”
Section: Na -Ive Codesmentioning
confidence: 99%
“…Bilardi et al [7] have pointed out that it is possible to optimize memory hierarchy performance by using a Gray code order to schedule the eight sub-problems so that there is always one sub-matrix in common between successive subproblems. One such order can be found in Figure 3(a) if the sub-problems are executed in left-to-right, top-to-bottom order.…”
Section: Na -Ive Codesmentioning
confidence: 99%
“…For example, HASA dynamic algorithm is bound to have fewer floating point operations than Balanced, because the former apply Strassen's division more times than the latter especially for non-square matrices; however, Balanced algorithm achieves on average 1.3% execution-time reduc- tion instead HASA dynamic achieves on average 0.5%. 3 In general, Balanced presents very predictable performance with often better peak performance than HASA dynamic.…”
Section: Hp Zv6000 Athlon-64 2ghz Using Atlas Data Locality Vs Opementioning
confidence: 99%
“…GotoBLAS peak performance is 4.5 GFLOPS, Balanced and HASA dynamic peak performance is normalized to 5.4 GFLOPS (as comparison). For this architecture (with a faster memory hierarchy and processor than in Section 4.1), the recursion point is empirically found at n1 = 900 and we stop the recursion when a matrix size is smaller 3 The input set has mostly small problems, thus the average time reduction is biased towards small values. than n1.…”
Section: Gotoblas Strassen Vs Faster MMmentioning
confidence: 99%
See 1 more Smart Citation
“…In this paper, we discuss a single but fundamental algorithm in dense linear algebra: matrix multiply (MM). We propose an algorithm that automatically adapts to any architecture and applies to any size and shape matrices stored in double precision and in either row or column-major layout (i.e., our algorithm is suitable for both C and FORTRAN, algorithms using row-major order [Frens and Wise 1997;Eiron et al 1998;Whaley and Dongarra 1998;Bilardi et al 2001], and using column-major order [Higham 1990;Whaley and Dongarra 1998;Goto and van de Geijn 2008]). …”
Section: Introductionmentioning
confidence: 99%