Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

Frens, Jeremy D.; Wise, David S.

doi:10.1145/263764.263789

Cited by 70 publications

(60 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We have built a propotype compiler to translate C programs using row-major matrices and cartesian indices to Morton-order using dilated indices. We had already demonstrated the ease of tree-wise scheduling parallel processors in [7], and we continue to search for similar quadtree algorithms [17,6].…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free

Wise

2000

Euro-Par 2000 Parallel Processing

Self Cite

View full text Add to dashboard Cite

Abstract. Definitions for the uniform representation of d-dimensional matrices serially in Morton-order (or Z-order) support both their use with cartesian indices, and their divide-and-conquer manipulation as quaternary trees. In the latter case, d-dimensional arrays are accessed as 2 d -ary trees. This data structure is important because, at once, it relaxes serious problems of locality and latency, and the tree helps schedule multiprocessing. It enables algorithms that avoid cache misses and page faults at all levels in hierarchical memory, independently of a specific runtime environment.This paper gathers the properties of Morton order and its mappings to other indexings, and outlines for compiler support of it. Statistics elsewhere show that the new ordering and block algorithms achieve high flop rates and, indirectly, parallelism without any low-level tuning.

show abstract

Section: Resultsmentioning

confidence: 99%

“…Fortunately, as the next section shows, most conversions can be elided. It is remarkable how often these basic properties of Morton ordering have been reintroduced in different contexts [3,7,9,12,16]. Samet gives an excellent history [13].…”

Section: Theoremmentioning

confidence: 99%

Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free

Wise

2000

Euro-Par 2000 Parallel Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…The nonlinear layout function we use has been variously described as being based either on quadtrees [16] or on space-filling curves [22,32,34]. This layout is known in parallel computing as the Morton ordering and has been used for load balancing purposes [7,25,26,33,36,40].…”

Section: Algorithm 6: Non-linear Array Layoutmentioning

confidence: 99%

Cache-efficient matrix transposition

Chatterjee

Sen

Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550)

View full text Add to dashboard Cite

We investigate the memory system performance of several algorithms for transposing an N N matrix in-place, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time of the algorithms. We use various memory models to capture and analyze the effect of various facets of cache memory architecture that guide the choice of a particular algorithm, and attempt to experimentally validate the predictions of the model. Our major conclusions are as follows: limited associativity in the mapping from main memory addresses to cache sets can significantly degrade running time; the limited number of TLB entries can easily lead to thrashing; the fanciest optimal algorithms are not competitive on real machines even at fairly large problem sizes unless cache miss penalties are quite high; low-level performance tuning "hacks", such as register tiling and array alignment, can significantly distort the effects of improved algorithms; and hierarchical nonlinear layouts are inherently superior to the standard This work is supported in part by DARPA Grant DABT63-98-1-0001, NSF Grants CDA-97-2637 and CDA-95-12356, The University of North Carolina at Chapel Hill, Duke University, and an equipment donation through Intel Corporation's Technology for Education 2000 Program. The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government. canonical layouts (such as row-or column-major) for this problem.

show abstract

“…Like traditional tiling techniques [41,75], cache oblivious algorithms for matrix multiply and LU factorization have been shown to asymptotically minimize data movement among various levels of the memory hierarchy, under certain cache modeling assumptions [83,33,1,30]. Unlike tiling, cache-oblivious algorithms do not make explicit reference to a "tile size" tuning parameter, and thus appear to eliminate the need to search for optimal cache tile sizes either by modeling or by empirical search.…”

Section: Dense and Sparse Linear Algebramentioning

confidence: 99%

Statistical Models for Empirical Search-Based Performance Tuning

Vuduc

Demmel

Bilmes

2004

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Achieving peak performance from the computational kernels that dominate application performance often requires extensive machine-dependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementations of a kernel, and (2) selecting the fastest implementation by a combination of heuristic modeling, heuristic pruning, and empirical search (i.e., actually running the code). This paper presents quantitative data that motivates the development of such a search-based system, using dense matrix multiply as a case study. The statistical distributions of performance within spaces of reasonable implementations, when observed on a variety of hardware platforms, lead us to pose and address two general problems which arise during the search process. First, we develop a heuristic for stopping an exhaustive compile-time search early if a near-optimal implementation is found. Second, we show how to construct run-time decision rules, based on run-time inputs, for selecting from among a subset of the best implementations when the space of inputs can be described by continuously varying features. We address both problems by using statistical modeling techniques that exploit the large amount of performance data collected during the search. We demonstrate these methods on actual performance data collected by the PHiPAC tuning system for dense matrix multiply.We close with a survey of recent projects that use or otherwise advocate an empirical search-based approach to code generation and algorithm selection, whether at the level of computational kernels, compiler and run-time systems, or problem-solving environments. Collectively, these efforts suggest a number of possible software architectures for constructing platform-adapted libraries and applications.

show abstract

Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

Cited by 70 publications

References 20 publications

Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free

Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free

Cache-efficient matrix transposition

Statistical Models for Empirical Search-Based Performance Tuning

Contact Info

Product

Resources

About