An experimental comparison of cache-oblivious and cache-conscious programs

Yotov, Kamen; Roeder, Tom; Pingali, Keshav; Gunnels, John A.; Gustavson, Fred G.

doi:10.1145/1248377.1248394

Cited by 66 publications

(44 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Section: Comparison Of I-gep and Blas Routinescontrasting

confidence: 73%

“…The experimental results in [34] report performance level of only about 35% of peak for Intel P4 Xeon which is significantly lower than what we obtain for the same machine (50-58%). We conjecture that our improved performance is partly due to our use of SSE2 instructions, especially since [34] obtained performance levels of 60-75% for SUN UltraSPARC IIIi, IBM Power 5 and Intel Itanium 2 using FMA instructions. These latter results nicely complement our results for Intel P4 Xeon and AMD Opteron and further suggest that reasonable performance levels can be reached for square matrix multiplication on different architectures using relatively simple code that does not directly depend on cache parameters.…”

Section: Comparison Of I-gep and Blas Routinescontrasting

confidence: 73%

“…Recursive square matrix multiplication using an iterative base case similar to our implementations is studied in [34]. The experimental results in [34] report performance level of only about 35% of peak for Intel P4 Xeon which is significantly lower than what we obtain for the same machine (50-58%).…”

Section: Comparison Of I-gep and Blas Routinesmentioning

confidence: 76%

See 2 more Smart Citations

The Cache-Oblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation

Chowdhury

Ramachandran

2010

Theory Comput Syst

View full text Add to dashboard Cite

We consider triply-nested loops of the type that occur in the standard Gaussian elimination algorithm, which we denote by GEP (or the Gaussian Elimination Paradigm). We present two related cache-oblivious methods I-GEP and C-GEP, both of which reduce the number of cache misses incurred (or I/Os performed) by the computation over that performed by standard GEP by a factor of √ M, where M is the size of the cache. Cache-oblivious I-GEP computes in-place and solves most of the known applications of GEP including Gaussian elimination and LU-decomposition without pivoting and Floyd-Warshall all-pairs shortest paths. Cache-oblivious C-GEP uses a modest amount of additional space, but is completely general and applies to any code in GEP form. Both I-GEP and C-GEP produce system-independent cacheefficient code, and are potentially applicable to being used by optimizing compilers for loop transformation.We present parallel I-GEP and C-GEP that achieve good speed-up and match the sequential caching performance cache-obliviously for both shared and distributed caches for sufficiently large inputs.We present extensive experimental results for both in-core and out-of-core performance of our algorithms. We consider both sequential and parallel implementations, and compare them with finely-tuned cache-aware BLAS code for matrix multiplication and Gaussian elimination without pivoting. Our results indicate that cacheoblivious GEP offers an attractive trade-off between efficiency and portability.This work was supported in part by NSF Grant CCF-0514876 and NSF CISE Research Infrastructure Grant EIA-0303609. This journal submission incorporates results on the cache-oblivious paradigm that were presented in preliminary form in [8] and [9].

show abstract

Section: Comparison Of I-gep and Blas Routinescontrasting

confidence: 73%

Section: Comparison Of I-gep and Blas Routinescontrasting

confidence: 73%

See 1 more Smart Citation

The Cache-Oblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation

Chowdhury

Ramachandran

2010

Theory Comput Syst

View full text Add to dashboard Cite

show abstract

“…Cache-oblivious algorithms can get good performance on a wide variety of platforms with relatively little programmer effort. Although most high-performance linear algebra libraries are hand-tuned or auto-tuned for specific architectures, there have been a few attempts to write competitive cache-oblivious libraries [32], [33].…”

Section: A Cache-oblivious Algorithmsmentioning

confidence: 99%

Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication

Demmel

Eliahu

Fox

et al. 2013

2013 IEEE 27th International Symposium on Parallel and Distributed Processing

100

View full text Add to dashboard Cite

Abstract-Communication-optimal algorithms are known for square matrix multiplication. Here, we obtain the first communication-optimal algorithm for all dimensions of rectangular matrices. Combining the dimension-splitting technique of Frigo, Leiserson, Prokop and Ramachandran (1999) with the recursive BFS/DFS approach of Ballard, Demmel, Holtz, Lipshitz and Schwartz (2012) allows for a communication-optimal as well as cache-and network-oblivious algorithm. Moreover, the implementation is simple: approximately 50 lines of code for the shared-memory version. Since the new algorithm minimizes communication across the network, between NUMA domains, and between levels of cache, it performs well in practice on both shared-and distributed-memory machines. We show significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.

show abstract

“…Yotov et al [30] describes Cache-oblivious algorithms which allow applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm each division step creates sub-problems of smaller size, and when the working set of a sub-problem fits in some level of the memory hierarchy, the computations in that sub-problem can be executed without suffering capacity misses at that level.…”

Section: Related Workmentioning

confidence: 99%

Parallel Two-Sided Matrix Reduction to Band Bidiagonal Form on Multicore Architectures

Ltaief

Kurzak

Dongarra

2010

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract. The objective of this paper is to extend, in the context of multicore architectures, the concepts of tile algorithms [Buttari et al., 2007] for Cholesky, LU, QR factorizations to the family of two-sided factorizations. In particular, the bidiagonal reduction of a general, dense matrix is very often used as a pre-processing step for calculating the Singular Value Decomposition. Furthermore, in the Top500 list of June 2008, 98% of the fastest parallel systems in the world were based on multicores. This confronts the scientific software community with both a daunting challenge and a unique opportunity. The challenge arises from the disturbing mismatch between the design of systems based on this new chip architecture -hundreds of thousands of nodes, a million or more cores, reduced bandwidth and memory available to cores -and the components of the traditional software stack, such as numerical libraries, on which scientific applications have relied for their accuracy and performance. The manycore trend has even more exacerbated the problem, and it becomes critical to efficiently integrate existing or new numerical linear algebra algorithms suitable for such hardware. By exploiting the concept of tile algorithms in the multicore environment (i.e., high level of parallelism with fine granularity and high performance data representation combined with a dynamic data driven execution), the band bidiagonal reduction presented here achieves 94 Gflop/s on a 12000 × 12000 matrix with 16 Intel Tigerton 2.4 GHz processors. The main drawback of the tile algorithms approach for the bidiagonal reduction is that the full reduction can not be obtained in one stage. Other methods have to be considered to further reduce the band matrix to the required form.

show abstract

An experimental comparison of cache-oblivious and cache-conscious programs

Cited by 66 publications

References 27 publications

The Cache-Oblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation

The Cache-Oblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation

Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication

Parallel Two-Sided Matrix Reduction to Band Bidiagonal Form on Multicore Architectures

Contact Info

Product

Resources

About