SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance

Juckeland, Guido; Brantley, William C.; Chandrasekaran, Sunita; Chapman, Barbara; Che, Shuai; Colgrove, Mathew E.; Feng, Huiyu; Grund, Alexander; Henschel, Robert; Hwu, Wen-mei W.; Li, Huian; Müller, Matthias S.; Nagel, Wolfgang E.; Perminov, Maxim; Shelepugin, Pavel; Skadron, Kevin; Stratton, John A.; Titov, A.; Wang, Ke; Waveren, G. Matthijs van; Whitney, Brian; Wienke, Sandra; Xu, Rengan; Kumaran, Kalyan

doi:10.1007/978-3-319-17248-4_3

Cited by 46 publications

(21 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…First, we evaluate a set of microbenchmarks to measure the effect of each of our proposed optimizations in isolation and combination. Second, to get complete end-toend performance numbers, we run workloads from SpecAC-CEL [31] and graph500 [52], and we show the performance across a range of fast memory oversubscription scenarios. Third, we sweep the design space to highlight the interesting behaviors that arise and to identify the configuration parameters that perform the best.…”

Section: Methodsmentioning

confidence: 99%

Nimble Page Management for Tiered Memory Systems

Yan

Lustig

Nellans

et al. 2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

101

102

View full text Add to dashboard Cite

Software-controlled heterogeneous memory systems have the potential to increase the performance and cost efficiency of computing systems. However they can only deliver on this promise if supported by efficient page management policies and mechanisms within the operating system (OS). Current OS implementations do not support efficient tiering of data between heterogeneous memories. Instead, they rely on expensive offlining of memory or swapping data to disk as a means of profiling and migrating hot or cold data between memory nodes. They also leave numerous optimizations on the table; for example, multi-threaded hardware is not leveraged to maximize page migration throughput, resulting in up to 95% under-utilization of available memory bandwidth. To remedy these shortcomings, we propose and implement a general purpose OS-integrated multi-level memory management system that reuses current OS page tracking structures to tier pages directly between memories with no additional monitoring overhead. We augment this system with four additional optimizations: native support for transparent huge page migration, multi-threaded migration of a page, concurrent migration of multiple pages, and symmetric exchange of pages. Combined, these optimizations dramatically reduce kernel software overheads and improve raw page migration throughput over 15×. Implemented in Linux and evaluated on x86, Power, and ARM64 systems, our OS support for heterogeneous memories improves application performance 40% over baseline Linux for a suite of real-world memory-intensive workloads utilizing a multilevel disaggregated memory system.

show abstract

Section: Methodsmentioning

confidence: 99%

Nimble Page Management for Tiered Memory Systems

Yan

Lustig

Nellans

et al. 2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

101

102

View full text Add to dashboard Cite

show abstract

“…We use all the 19 OpenCL benchmarks in SPECACCEL-v1.2 [28], with each benchmark including one or multiple kernels. For each benchmark, we use its test, train and ref inputs, and present the execution times of the whole program and OpenCL kernels.…”

Section: Discussionmentioning

confidence: 99%

“…• OpenCL-specific parameter-guided interprocedural analysis (IPA): We propose the IPA for analyzing the memory objects accessed in both the host and kernel codes, and find new optimization opportunities of static tiling for irregular accesses and using re-computation for saving SPM capacity. • We implement the bandwidth-aware loop tiling approach SWCL, and evaluate it using the SPECACCEL [28] benchmark suite. Experimental results demonstrate that it can bring significant performance improvement, i.e., up to 4x, with a geometric average of 26%.…”

Section: Introductionmentioning

confidence: 99%

Bandwidth-Aware Loop Tiling for DMA-Supported Scratchpad Memory

Liu

Cui

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Scratchpad Memory (SPM) is widely used in emerging domain-specific architectures and accelerators for improving energy efficiency and time predictability. Typically, SPM-based architectures use DMA for fetching data from off-chip memory and global load instructions for loading fine-grained data directly into registers. For such architectures, neither capacity-only nor bandwidthonly loop tiling can efficiently use the bandwidth and SPM. This paper introduces a bandwidth-aware loop tiling approach that enables a tradeoff between SPM space utilization and bandwidth utilization to be made, by leveraging a runtime tiling framework and a cross-host-kernel IPA. Experimental results demonstrate that our approach can achieve the performance improvement of up to 4x, with a geometric average of 26%.

show abstract

“…The Polybench [28] and SPEC ACCEL [13] OpenMP 4 benchmark suites are used to evaluate the efficacy of the coalescing-analysis-informed loop reshaping of OpenMP 4.x parallel loop nests. Execution times are reported for two experimental setup machines: an IBM POWER8 host with an Nvidia P100 GPU and an IBM POWER9 host with an Nvidia V100 GPU accelerator.…”

Section: Informed Loop Reshaping Performance Impactmentioning

confidence: 99%

Memory-access-aware Safety and Profitability Analysis for Transformation of Accelerator-bound OpenMP Loops

Chikin¹,

Lloyd

Amaral

et al. 2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Iteration Point Difference Analysis is a new static analysis framework that can be used to determine the memory coalescing characteristics of parallel loops that target GPU offloading and to ascertain safety and profitability of loop transformations with the goal of improving their memory access characteristics. This analysis can propagate definitions through control flow, works for non-affine expressions, and is capable of analyzing expressions that reference conditionally defined values. This analysis framework enables safe and profitable loop transformations. Experimental results demonstrate potential for dramatic performance improvements. GPU kernel execution time across the Polybench suite is improved by up to 25.5× on an Nvidia P100 with benchmark overall improvement of up to 3.2×. An opportunity detected in a SPEC ACCEL benchmark yields kernel speedup of 86.5× with a benchmark improvement of 3.3×. This work also demonstrates how architecture-aware compilers improve code portability and reduce programmer effort.

show abstract

SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance

Cited by 46 publications

References 20 publications

Nimble Page Management for Tiered Memory Systems

Nimble Page Management for Tiered Memory Systems

Bandwidth-Aware Loop Tiling for DMA-Supported Scratchpad Memory

Memory-access-aware Safety and Profitability Analysis for Transformation of Accelerator-bound OpenMP Loops

Contact Info

Product

Resources

About