Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures 2007
DOI: 10.1145/1248377.1248394
|View full text |Cite
|
Sign up to set email alerts
|

An experimental comparison of cache-oblivious and cache-conscious programs

Abstract: Cache-oblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm -each division step creates sub-problems of smaller size, and when the working set of a sub-problem fits in some level of the memory hierarchy, the computations in that sub-problem can be executed without suffering capacity misses at that level. In this way, d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
39
4

Year Published

2008
2008
2022
2022

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 66 publications
(44 citation statements)
references
References 27 publications
1
39
4
Order By: Relevance
“…The experimental results in [34] report performance level of only about 35% of peak for Intel P4 Xeon which is significantly lower than what we obtain for the same machine (50-58%). We conjecture that our improved performance is partly due to our use of SSE2 instructions, especially since [34] obtained performance levels of 60-75% for SUN UltraSPARC IIIi, IBM Power 5 and Intel Itanium 2 using FMA instructions.…”
Section: Comparison Of I-gep and Blas Routinescontrasting
confidence: 73%
See 2 more Smart Citations
“…The experimental results in [34] report performance level of only about 35% of peak for Intel P4 Xeon which is significantly lower than what we obtain for the same machine (50-58%). We conjecture that our improved performance is partly due to our use of SSE2 instructions, especially since [34] obtained performance levels of 60-75% for SUN UltraSPARC IIIi, IBM Power 5 and Intel Itanium 2 using FMA instructions.…”
Section: Comparison Of I-gep and Blas Routinescontrasting
confidence: 73%
“…The experimental results in [34] report performance level of only about 35% of peak for Intel P4 Xeon which is significantly lower than what we obtain for the same machine (50-58%). We conjecture that our improved performance is partly due to our use of SSE2 instructions, especially since [34] obtained performance levels of 60-75% for SUN UltraSPARC IIIi, IBM Power 5 and Intel Itanium 2 using FMA instructions. These latter results nicely complement our results for Intel P4 Xeon and AMD Opteron and further suggest that reasonable performance levels can be reached for square matrix multiplication on different architectures using relatively simple code that does not directly depend on cache parameters.…”
Section: Comparison Of I-gep and Blas Routinescontrasting
confidence: 73%
See 1 more Smart Citation
“…Cache-oblivious algorithms can get good performance on a wide variety of platforms with relatively little programmer effort. Although most high-performance linear algebra libraries are hand-tuned or auto-tuned for specific architectures, there have been a few attempts to write competitive cache-oblivious libraries [32], [33].…”
Section: A Cache-oblivious Algorithmsmentioning
confidence: 99%
“…Yotov et al [30] describes Cache-oblivious algorithms which allow applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm each division step creates sub-problems of smaller size, and when the working set of a sub-problem fits in some level of the memory hierarchy, the computations in that sub-problem can be executed without suffering capacity misses at that level.…”
Section: Related Workmentioning
confidence: 99%