Automatic benchmark generation for cache optimization of matrix operations

McCalpin, John D.; Smotherman, Mark

doi:10.1145/1122018.1122054

Cited by 12 publications

(14 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Two-dimensional tensor transposition (i.e., matrix transposition) is a well studied operation, including optimizations for blocking, vectorization, unrolling, and software prefetching [3,6,11,13,14,25]. The same optimizations are investigated in the context threedimensional out-of-place tensor transpositions on CPUs [10,22].…”

Section: Related Workmentioning

confidence: 99%

HPTT: a high-performance tensor transposition C++ library

Springer

Bientinesi

2017

Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming

View full text Add to dashboard Cite

Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multithreading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called microkernel. This modular design-inspired by BLIS-makes HPTT easy to port to different architectures, by only replacing the handvectorized micro-kernel (e.g., a 4 × 4 transpose). HPTT also offers an optional autotuning framework-guided by performance heuristics-that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen's tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1×.

show abstract

Section: Related Workmentioning

confidence: 99%

HPTT: a high-performance tensor transposition C++ library

Springer

Bientinesi

2017

Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming

View full text Add to dashboard Cite

show abstract

“…We use PinPoints [54,70] to collect workload traces. We use 32 benchmarks from Stream [56,63], SPEC CPU2006 [90], TPC [93] and GUPS [26], each of which is used for a single-core workload. We construct 32 two-, four-, and eight-core workloads, for a total of 96 multi-core workloads (randomly selected from the 32 benchmarks).…”

Section: Dram Latency and Performance Analysismentioning

confidence: 99%

Design-Induced Latency Variation in Modern DRAM Chips

Lee

Khan

Subramanian

et al. 2017

Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems

View full text Add to dashboard Cite

Variation has been shown to exist across the cells within a modern DRAM chip. Prior work has studied and exploited several forms of variation, such as manufacturing-processor temperature-induced variation. We empirically demonstrate a new form of variation that exists within a real DRAM chip, induced by the design and placement of different components in the DRAM chip: different regions in DRAM, based on their relative distances from the peripheral structures, require different minimum access latencies for reliable operation. In particular, we show that in most real DRAM chips, cells closer to the peripheral structures can be accessed much faster than cells that are farther. We call this phenomenon design-induced variation in DRAM. Our goals are to i) understand design-induced variation that exists in real, stateof-the-art DRAM chips, ii) exploit it to develop low-cost mechanisms that can dynamically find and use the lowest latency at which to operate a DRAM chip reliably, and, thus, iii) improve overall system performance while ensuring reliable system operation.To this end, we first experimentally demonstrate and analyze designed-induced variation in modern DRAM devices by testing and characterizing 96 DIMMs (768 DRAM chips). Our characterization identifies DRAM regions that are vulnerable to errors, if operated at lower latency, and finds consistency in their locations across a given DRAM chip generation, due to design-induced variation. Based on our extensive experimental analysis, we develop two mechanisms that reliably reduce DRAM latency. First, DIVA Profiling uses runtime profiling to dynamically identify the lowest DRAM latency that does not introduce failures. DIVA Profiling exploits designinduced variation and periodically profiles only the vulnerable regions to determine the lowest DRAM latency at low cost. It is the first mechanism to dynamically determine the lowest latency that can be used to operate DRAM reliably. DIVA Profiling reduces the latency of read/write requests by 35.1%/57.8%, respectively, at 55℃. Our second mechanism, DIVA Shuffling, shuffles data such that values stored in vulnerable regions are mapped to multiple error-correcting code (ECC) codewords. As a result, DIVA Shuffling can correct 26% more multi-bit errors than conventional ECC. Combined together, our two mechanisms reduce read/write latency by 40.0%/60.5%, which translates to an overall system performance improvement of 14.7%/13.7%/13.8% (in 2-/4-/8-core systems) across a variety of workloads, while ensuring reliable operation.arXiv:1610.09604v2 [cs.AR] 15 May 2017 * We report the manufacturing date in a year-week (yy-ww) format. For example, 15-01 means that the DIMM was manufactured during the first week of 2015. † We report two representative timing factors: F r eq (the data transfer frequency per pin) and t RC (the row access cycle time). ‡ The maximum DRAM chip size supported by our testing platform is 2Gb. § We report the DRAM die versions that are marked on the chip package. Since the die version changes when the DR...

show abstract

“…To put these results into perspective, a recent study on random tensor permutations by Lyakh [10] presents results between 30 and 55 GB/s, and between 10 and 33 GB/s for an Intel KNC system and an NVIDIA K20X system, respectively. 12 Since it is not known either which exact permutations are considered, or how the measurement is performed, these results should not be understood as a one-to-one comparison; however, they give an idea of the potential of TTC. Panels 5b (HSW) and 5e (M840) suggest that TTC's heuristics work so well that the search could almost be avoided.…”

Section: Performance Evaluationmentioning

confidence: 99%

“…Due to the non-contiguous memory access patterns and the vast number of architecture-specific optimizations required by modern vector processors (e.g., vectorization, blocking for caches, nonuniform memory accesses (NUMA)), writing high-performance tensor transpositions is a challenging task. Until now, many research efforts focused on 2D [5,9,11,12,18] and 3D transpositions [8,15], while higher dimensional transpositions [10,19] are mostly still uncovered.…”

Section: Introductionmentioning

confidence: 99%

TTC: a tensor transposition compiler for multiple architectures

Springer

Sankaran

Bientinesi

2016

Proceedings of the 3rd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming

View full text Add to dashboard Cite

We consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system's peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g., Intel Haswell, AMD Steamroller), Intel's Knights Corner as well as different CUDA-based GPUs such as NVIDIA's Kepler and Maxwell architectures. We report speedups of TTC over a meaningful baseline implementation generated by external C++ compilers; the results suggest that a domain-specific compiler can outperform its general purpose counterpart significantly: For instance, comparing with Intel's latest C++ compiler on the Haswell and Knights Corner architecture, TTC yields speedups of up to 8× and 32×, respectively. We also showcase TTC's support for multiple leading dimensions, making it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.

show abstract

Automatic benchmark generation for cache optimization of matrix operations

Cited by 12 publications

References 5 publications

HPTT: a high-performance tensor transposition C++ library

HPTT: a high-performance tensor transposition C++ library

Design-Induced Latency Variation in Modern DRAM Chips

TTC: a tensor transposition compiler for multiple architectures

Contact Info

Product

Resources

About