A case study in mechanically deriving dense linear algebra code

Marker, Bryan; Batory, Don; Geijn, Robert van de

doi:10.1177/1094342013492178

Cited by 9 publications

(18 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Well designed interfaces enable a variety of algorithms to be implemented quickly for prototyping and deployment. Automation takes this a step forward by automating the implementation process and even the exploration of algorithmic options [8]. Such automation is not possible without good interface design.…”

Section: Lessons Learnedmentioning

confidence: 99%

Interfaces are key

Marker

Geijn

Batory

2013

Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science A

Self Cite

View full text Add to dashboard Cite

Section: Lessons Learnedmentioning

confidence: 99%

Interfaces are key

Marker

Geijn

Batory

2013

Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science A

Self Cite

View full text Add to dashboard Cite

“…For distributed-memory DLA, first-order cost estimates are sufficient [17,18,19] to enable an expert to judge trade offs between the cost of communicating data over a network and increasing parallelism that is enabled by that communication. Just as an expert estimates efficiency when manually coding, DxTer does so automatically by summing the estimated runtime of all nodes on a graph [19].…”

Section: Design By Transformationmentioning

confidence: 99%

“…We have automated the exploration of these spaces (by generating all implementations using a methodical process) and we evaluate the efficiency of each implementation via cost estimation. 1 This is how we find the best-performing algorithm that experts would intuitively select [17,18,19]. In all tests, generated code is the same or better than experts' hand-produced implementations.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Understanding performance stairs

Marker

Batory

Geijn

2014

Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering

Self Cite

View full text Add to dashboard Cite

How do experts navigate the huge space of implementations for a given specification to find an efficient choice with minimal searching? Answer: They use "heuristics" -rules of thumb that are more street wisdom than scientific fact. We provide a scientific justification for Dense Linear Algebra (DLA) heuristics by showing that only a few decisions (out of many possible) are critical to performance; once these decisions are made, the die is cast and only relatively minor performance improvements are possible. The (implementation × performance) space of DLA is stair-stepped. Each stair is a set of implementations with very similar performance and (surprisingly) share key design decision(s). High-performance stairs align with heuristics that prescribe certain decisions in a particular context. Stairs also tell us how to tailor the search engine of a DLA code generator to reduce the time it needs to find implementations that are as good or better than those crafted by experts.

show abstract

“…The DxT project [Marker et al 2013] uses a cost model based on operation count and communication costs to estimate the performance of many possible implementations of distributed-memory dense linear algebra, by composing each algorithm mostly out of Level 3 BLAS subroutines. They use a similar style of search heuristics to narrow the space, focusing on transformations likely to be helpful.…”

Section: Related Workmentioning

confidence: 99%

Reliable Generation of High-Performance Matrix Algebra

Nelson

Belter

Siek

et al. 2015

ACM Trans. Math. Softw.

View full text Add to dashboard Cite

Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) to obtain portable high performance. However, many numerical algorithms require several BLAS calls in sequence, and those successive calls do not achieve optimal performance. The entire sequence needs to be optimized in concert. Instead of vendor-tuned BLAS, a programmer could start with source code in Fortran or C (e.g., based on the Netlib BLAS) and use a state-of-the-art optimizing compiler. However, our experiments show that optimizing compilers often attain only one-quarter of the performance of hand-optimized code. In this article, we present a domain-specific compiler for matrix kernels, the Build to Order BLAS (BTO), that reliably achieves high performance using a scalable search algorithm for choosing the best combination of loop fusion, array contraction, and multithreading for data parallelism. The BTO compiler generates code that is between 16% slower and 39% faster than hand-optimized code.

show abstract

A case study in mechanically deriving dense linear algebra code

Cited by 9 publications

References 16 publications

Interfaces are key

Interfaces are key

Understanding performance stairs

Reliable Generation of High-Performance Matrix Algebra

Contact Info

Product

Resources

About