Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 1997
DOI: 10.1145/263764.263789
|View full text |Cite
|
Sign up to set email alerts
|

Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

Abstract: An elementary, machine-independent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimrd code, tracking hand-coded BLAS3 routines. Proof of concept is demonstrated by racing the in-place algorithm against manufacturer's hand-tuned BLAS3 routines; it can win.The recursive code bifurcates naturafly at the top level into independent block-oriented processes, that each writes to a disjoint and contiguous… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
58
0
2

Year Published

2000
2000
2006
2006

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 70 publications
(60 citation statements)
references
References 20 publications
0
58
0
2
Order By: Relevance
“…We have built a propotype compiler to translate C programs using row-major matrices and cartesian indices to Morton-order using dilated indices. We had already demonstrated the ease of tree-wise scheduling parallel processors in [7], and we continue to search for similar quadtree algorithms [17,6].…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…We have built a propotype compiler to translate C programs using row-major matrices and cartesian indices to Morton-order using dilated indices. We had already demonstrated the ease of tree-wise scheduling parallel processors in [7], and we continue to search for similar quadtree algorithms [17,6].…”
Section: Resultsmentioning
confidence: 99%
“…Fortunately, as the next section shows, most conversions can be elided. It is remarkable how often these basic properties of Morton ordering have been reintroduced in different contexts [3,7,9,12,16]. Samet gives an excellent history [13].…”
Section: Theoremmentioning
confidence: 99%
“…The nonlinear layout function we use has been variously described as being based either on quadtrees [16] or on space-filling curves [22,32,34]. This layout is known in parallel computing as the Morton ordering and has been used for load balancing purposes [7,25,26,33,36,40].…”
Section: Algorithm 6: Non-linear Array Layoutmentioning
confidence: 99%
“…Like traditional tiling techniques [41,75], cache oblivious algorithms for matrix multiply and LU factorization have been shown to asymptotically minimize data movement among various levels of the memory hierarchy, under certain cache modeling assumptions [83,33,1,30]. Unlike tiling, cache-oblivious algorithms do not make explicit reference to a "tile size" tuning parameter, and thus appear to eliminate the need to search for optimal cache tile sizes either by modeling or by empirical search.…”
Section: Dense and Sparse Linear Algebramentioning
confidence: 99%