Modeling Emerging Memory-Divergent GPU Applications

Wang, Lu; Jahre, Magnus; Adileh, Almutaz; Wang, Zhiying; Eeckhout, Lieven

doi:10.1109/lca.2019.2923618

Cited by 6 publications

(4 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Aimed at providing a deep insight into GPU performance, prior studies [25,50,51] have proposed GPU analytical models based on interval analysis, a well-known approach for accurately modeling CPU performance [14,28]. The key idea of modeling the performance with interval analysis is that a warp scheduler can sustain its maximum issue rate when no stall events occur.…”

Section: Gpu Analytical Modelsmentioning

confidence: 99%

“…To quantify the importance of capturing the key core-side stall events, we examine how much impact the enhancements in modern GPU core microarchitectures have on the performance. We also analyze the impact of the enhancements on the modeling accuracy of MDM [50,51], the state-of-the-art GPU analytical model. In this experiment, we configure Accel-Sim cycle-level simulator to simulate the simplified GPU core assumed by MDM (i.e., no sub-cores, 32 lanes per functional unit and 32 L1 D$ banks, and non-sectored L1 D$s).…”

Section: Limitationsmentioning

confidence: 99%

“…Based on GPUMech, Heo et al [23] proposed a GPU performance model that predicts the execution time of a single DNN layer. However, none of the proposed models, including the state-of-the-art MDM [50,51], considers the key intra-core stall cycles which GCoM focuses on, leading to high modeling errors with modern GPU designs.…”

Section: Related Work 61 Gpu Analytical Modelingmentioning

confidence: 99%

See 2 more Smart Citations

GCoM

Lee

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

Analytical models can greatly help computer architects perform orders of magnitude faster early-stage design space exploration than using cycle-level simulators. To facilitate rapid design space exploration for graphics processing units (GPUs), prior studies have proposed GPU analytical models which capture first-order stall events causing performance degradation; however, the existing analytical models cannot accurately model modern GPUs due to their outdated and highly abstract GPU core microarchitecture assumptions. Therefore, to accurately evaluate the performance of modern GPUs, we need a new GPU analytical model which accurately captures the stall events incurred by the significant changes in the core microarchitectures of modern GPUs.We propose GCoM, an accurate GPU analytical model which faithfully captures the key core-side stall events of modern GPUs. Through detailed microarchitecture-driven GPU core modeling, GCoM accurately models modern GPUs by revealing the following key core-side stalls overlooked by the existing GPU analytical models. First, GCoM identifies the compute structural stall events caused by the limited per-sub-core functional units. Second, GCoM exposes the memory structural stalls due to the limited banks and shared nature of per-core L1 data caches. Third, GCoM correctly predicts the memory data stalls induced by the sectored L1 data caches which split a cache line into a set of sectors sharing the same tag. Fourth, GCoM captures the idle stalls incurred by the inter-and intra-core load imbalances. Our experiments using an NVIDIA RTX 2060 configuration show that GCoM greatly improves the modeling accuracy by achieving a mean absolute error of 10.0% against Accel-Sim cycle-level simulator, whereas the state-of-the-art GPU analytical model achieves a mean absolute error of 44.9%.

show abstract

Section: Gpu Analytical Modelsmentioning

confidence: 99%

Section: Limitationsmentioning

confidence: 99%

Section: Related Work 61 Gpu Analytical Modelingmentioning

confidence: 99%

See 1 more Smart Citation

GCoM

Lee

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

show abstract

“…While seeding is inherently a memory-bound algorithm, CPU implementations can only issue limited number of parallel memory requests and hence cannot saturate memory bandwidth (only uses 11.5% of peak bandwidth). Current GPUs are not wellsuited because of significant memory divergence during tree traversal (Wang et al, 2019). To make better use of available memory bandwidth, we design a custom seeding accelerator and prototype it on an FPGA.…”

Section: Fpga Prototypementioning

confidence: 99%

Accelerating Maximal-Exact-Match Seeding with Enumerated Radix Trees

Subramaniyan

Wadden

Goliya

et al. 2020

Preprint

View full text Add to dashboard Cite

Motivation: Read alignment is a time-consuming step in genome sequence analysis. In the read alignment software BWA-MEM and the recently published faster version BWA-MEM2, the seeding step is a major bottleneck, for instance, contributing 38% to the overall execution time in BWA-MEM2 when aligning single-end whole human genome reads from the Platinum Genomes dataset. This is because both BWA-MEM and BWA-MEM2 use a compressed index structure called the FMD-Index, which results in high memory bandwidth requirements for seeding, primarily due to its character-by-character processing of reads. Results: We propose a memory bandwidth-aware data structure for maximal-exact-match seeding called Enumerated Radix Tree (ERT). ERT trades off memory capacity to improve seeding performance (∼60 GB index for human genome). Together with optimizations to the seeding algorithm and mate-rescue step, ERT when integrated into BWA-MEM2 speeds up overall read alignment by 1.28× and provides up to 2.1× higher seeding performance while guaranteeing identical output to the original software. Furthermore, we prototype an FPGA implementation of ERT on Amazon EC2 F1 cloud and observe 1.6× higher seeding throughput over a 48-thread optimized CPU-ERT implementation. Availability and implementation: https://github.com/arun-sub/bwa-mem2

show abstract