Proceedings of the 33rd Annual on Southeast Regional Conference - ACM-SE 33 1995
DOI: 10.1145/1122018.1122054
|View full text |Cite
|
Sign up to set email alerts
|

Automatic benchmark generation for cache optimization of matrix operations

Abstract: Computationally intensive algorithms must usually be restructured to make the best use of cache memory in current high-performance, hierarchical memory computers. Unfortunately, cache conscious algorithms are sensitive to object sizes and addresses as well as the details of the cache and translation lookaside buffer geometries, and this sensitivity makes both automatic restructuring and hand-tuning difficult tasks. An optimization approach is presented in this paper that automatically generates and executes a … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2004
2004
2017
2017

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(14 citation statements)
references
References 5 publications
0
14
0
Order By: Relevance
“…Two-dimensional tensor transposition (i.e., matrix transposition) is a well studied operation, including optimizations for blocking, vectorization, unrolling, and software prefetching [3,6,11,13,14,25]. The same optimizations are investigated in the context threedimensional out-of-place tensor transpositions on CPUs [10,22].…”
Section: Related Workmentioning
confidence: 99%
“…Two-dimensional tensor transposition (i.e., matrix transposition) is a well studied operation, including optimizations for blocking, vectorization, unrolling, and software prefetching [3,6,11,13,14,25]. The same optimizations are investigated in the context threedimensional out-of-place tensor transpositions on CPUs [10,22].…”
Section: Related Workmentioning
confidence: 99%
“…We use PinPoints [54,70] to collect workload traces. We use 32 benchmarks from Stream [56,63], SPEC CPU2006 [90], TPC [93] and GUPS [26], each of which is used for a single-core workload. We construct 32 two-, four-, and eight-core workloads, for a total of 96 multi-core workloads (randomly selected from the 32 benchmarks).…”
Section: Dram Latency and Performance Analysismentioning
confidence: 99%
“…To put these results into perspective, a recent study on random tensor permutations by Lyakh [10] presents results between 30 and 55 GB/s, and between 10 and 33 GB/s for an Intel KNC system and an NVIDIA K20X system, respectively. 12 Since it is not known either which exact permutations are considered, or how the measurement is performed, these results should not be understood as a one-to-one comparison; however, they give an idea of the potential of TTC. Panels 5b (HSW) and 5e (M840) suggest that TTC's heuristics work so well that the search could almost be avoided.…”
Section: Performance Evaluationmentioning
confidence: 99%
“…Due to the non-contiguous memory access patterns and the vast number of architecture-specific optimizations required by modern vector processors (e.g., vectorization, blocking for caches, nonuniform memory accesses (NUMA)), writing high-performance tensor transpositions is a challenging task. Until now, many research efforts focused on 2D [5,9,11,12,18] and 3D transpositions [8,15], while higher dimensional transpositions [10,19] are mostly still uncovered.…”
Section: Introductionmentioning
confidence: 99%