TBPoint: Reducing Simulation Time for Large-Scale GPGPU Kernels

Huang, Jen-Cheng; Nai, Lifeng; Kim, Hyesoon; Lee, Hsien-Hsin S.

doi:10.1109/ipdps.2014.53

Cited by 18 publications

(10 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Simulating an entire GPU application on a cycle-level simulator [32], [33] is often impractical and this is even more true for long-running SQNN training applications. To aid in successful simulation of long-running applications, prior works have attempted to identify representative regions within applications and porting them to simulators for CPUs [4], [18], [34]- [36] and GPUs [37], [38].…”

Section: A Enabling Network-level Simulation For Sqnnsmentioning

confidence: 99%

SeqPoint: Identifying Representative Iterations of Sequence-Based Neural Networks

Pati

Aga

Sinclair

et al. 2020

2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

View full text Add to dashboard Cite

The ubiquity of deep neural networks (DNNs) continues to rise, making them a crucial application class for hardware optimizations. However, detailed profiling and characterization of DNN training remains difficult as these applications often run for hours to days on real hardware. Prior works have exploited the iterative nature of DNNs to profile a few training iterations to represent the entire training run. While such a strategy is sound for networks like convolutional neural networks (CNNs), where the nature of the computation is largely input independent, we observe in this work that this approach is sub-optimal for sequence-based neural networks (SQNNs) such as recurrent neural networks (RNNs). The amount and nature of computations in SQNNs can vary for each input, resulting in heterogeneity across iterations. Thus, arbitrarily selecting a few iterations is insufficient to accurately summarize the behavior of the entire training run. To tackle this challenge, we carefully study the factors that impact SQNN training iterations and identify input sequence length as the key determining factor for variations across iterations. We then use this observation to characterize all iterations of an SQNN training run (requiring no profiling or simulation of the application) and select representative iterations, which we term SeqPoints. We analyze two state-ofthe-art SQNNs, DeepSpeech2 and Google's Neural Machine Translation (GNMT), and show that SeqPoints can represent their entire training runs accurately, resulting in geomean errors of only 0.11% and 0.53%, respectively, when projecting overall runtime and 0.13% and 1.50% when projecting speedups due to architectural changes. This high accuracy is achieved while reducing the time needed for profiling by 345x and 214x for the two networks compared to full training runs. As a result, SeqPoint can enable analysis of SQNN training runs in mere minutes instead of hours or days.

show abstract

Section: A Enabling Network-level Simulation For Sqnnsmentioning

confidence: 99%

SeqPoint: Identifying Representative Iterations of Sequence-Based Neural Networks

Pati

Aga

Sinclair

et al. 2020

2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

View full text Add to dashboard Cite

show abstract

“…Existing solutions in the CPU space sample randomly [35], periodically [16], [17], or based on application phase behavior [18]. TBPoint [40] very recently proposes sampling-in-time for GPGPU workloads. Although TBPoint achieves high accuracy while simulating 10 to 20 percent of the total kernel execution time, sampling workloads with high control/memory divergence behavior remains challenging.…”

Section: Revisiting Cpu Simulation Acceleration Techniques For Gpgpumentioning

confidence: 99%

“…Recently, Huang et al accelerate GPGPU architecture simulation by sampling thread blocks [40] using TBPoint. Sampling thread blocks is a good idea since CUDA encourages programmers to write programs with little communication between thread blocks.…”

Section: Related Workmentioning

confidence: 99%

GPGPU-MiniBench: Accelerating GPGPU Micro-Architecture Simulation

Eeckhout

Goswami

et al. 2015

IEEE Trans. Comput.

View full text Add to dashboard Cite

Abstract-Graphics processing units (GPU), due to their massive computational power with up to thousands of concurrent threads and general-purpose GPU (GPGPU) programming models such as CUDA and OpenCL, have opened up new opportunities for speeding up general-purpose parallel applications. Unfortunately, pre-silicon architectural simulation of modern-day GPGPU architectures and workloads is extremely time-consuming. This paper addresses the GPGPU simulation challenge by proposing a framework, called GPGPU-MiniBench, for generating miniature, yet representative GPGPU workloads. GPGPU-MiniBench first summarizes the inherent execution behavior of existing GPGPU workloads in a profile. The central component in the profile is the Divergence Flow Statistics Graph (DFSG), which characterizes the dynamic control flow behavior including loops and branches of a GPGPU kernel. GPGPU-MiniBench generates a synthetic miniature GPGPU kernel that exhibits similar execution characteristics as the original workload, yet its execution time is much shorter thereby dramatically speeding up architectural simulation. Our experimental results show that GPGPU-MiniBench can speed up GPGPU architectural simulation by a factor of 49Â on average and up to 589Â, with an average IPC error of 4.7 percent across a broad set of GPGPU benchmarks from the CUDA SDK, Rodinia and Parboil benchmark suites. We also demonstrate the usefulness of GPGPU-MiniBench for driving GPU architecture exploration.

show abstract

“…We also validated our approach on a simulator and real hardware, whereas they only validated on a simulator. Huang et al [2014] use sampling technique to speed up GPU architecture simulation for CUDA applications where they achieve up to 10× speedup, whereas we achieve up to 7,284× speedup. Similarly, Lee and Ro [2013] parallelize the GPU architecture simulation, where they gain up to 4.15× speedup.…”

Section: Related Workmentioning

confidence: 99%

Minime-Gpu

Deniz

Şen

2015

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

We introduce MINIME-GPU, a novel automated benchmark synthesis framework for graphics processing units (GPUs) that serves to speed up architectural simulation of modern GPU architectures. Our framework captures important characteristics of original GPU applications and generates synthetic GPU benchmarks using the Open Computing Language (OpenCL) library from those applications. To the best of our knowledge, this is the first time synthetic OpenCL benchmarks for GPUs are generated from existing applications. We use several characteristics, including instruction throughput, compute unit occupancy, and memory efficiency, to compare the similarity of original applications and their corresponding synthetic benchmarks. The experimental results show that our synthetic benchmark generation framework is capable of generating synthetic benchmarks that have similar characteristics with the original applications from which they are generated. On average, the similarity (accuracy) is 96% and the speedup is 541×. In addition, our synthetic benchmarks use the OpenCL library, which allows us to obtain portable human readable benchmarks as opposed to using assembly-level code, and they are faster and smaller than the original applications from which they are generated. We experimentally validated that our synthetic benchmarks preserve the characteristics of the original applications across different architectures.

show abstract

TBPoint: Reducing Simulation Time for Large-Scale GPGPU Kernels

Cited by 18 publications

References 12 publications

SeqPoint: Identifying Representative Iterations of Sequence-Based Neural Networks

SeqPoint: Identifying Representative Iterations of Sequence-Based Neural Networks

GPGPU-MiniBench: Accelerating GPGPU Micro-Architecture Simulation

Minime-Gpu

Contact Info

Product

Resources

About