Bridging the Semantic Gaps of GPU Acceleration for Scale-out CNN-based Big Data Processing

Song, Mingcong; Hu, Yang; Xu, Yunlong; Li, Chao; Chen, Huixiang; Yuan, Jingling; Li, Tao

doi:10.1145/2967938.2967944

Cited by 17 publications

(5 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although CPUs have been proposed to accelerate CNNs by relying on multicore parallelism and SIMD instructions [14], [15], the number and complexity of the layers in modern CNN models make it very difficult to run the entire network on CPUs. To improve inference throughput, (fast) GPU solutions have been proposed to process a large amount of data [16], [17]. Field Programmable Gate Arrays (FPGAs), on the other hand, have been extensively used as an alternative to this problem as they offer good performance and reconfigurability [18]- [22].…”

Section: The Nmp Architecturementioning

confidence: 99%

Tensor slicing and optimization for multicore NPUs

Sousa

Pereira

Kwon

et al. 2023

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Section: The Nmp Architecturementioning

confidence: 99%

Tensor slicing and optimization for multicore NPUs

Sousa

Pereira

Kwon

et al. 2023

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

“…It collects the whole network information including connection relatives of each layer, layer type, input tensor dimensions, whether the parameters of each layer needs to be updated, memory footprint through automatic inference, GPUs information such as the number of cores, register, shared memory capacity, peak flops. Note that we use the same methods from [10], [17] to estimate the calculation time of each layer and each sub-task, and revise it with profiling the stand-alone running of one iteration since it has been shown that the CNN training has the characteristic of repetitive computation and predictability [34].…”

Section: System Overviewmentioning

confidence: 99%

TurboDL: Improving the CNN Training on GPU With Fine-Grained Multi-Streaming Scheduling

Jin

Shi

et al. 2021

IEEE Trans. Comput.

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) have evolved as powerful co-processors for the CNN training. Many new features have been introduced into GPUs such as concurrent kernel execution and hyper-Q technology. It is challenging to orchestrate concurrency for CNN (convolutional neural networks) training on GPUs since it may introduce synchronization overhead and poor resource utilization. Unlike previous research which mainly focuses on single layer or coarse-grained optimization, we introduce a critical-path based, asynchronous parallelization mechanism, and propose the optimization technique for the CNN training that takes into account global network architecture and GPU resource usage together. The proposed methods can effectively overlap the synchronization and the computation in different streams. As a result, the training process of CNN is accelerated. We have integrated our methods into Caffe. The experimental results show that the Caffe integrated with our methods can achieve 1.30X performance speedup on average compared with Caffe+cuDNN, and even higher performance speedup can be achieved for deeper, wider, and more complicated networks.

show abstract

“…vDNN [52] is a runtime memory manager to handle memory allocation, movement between CPU and GPU memory for DNN workload. Song et al [57] studied on characterizing the performance of GPU acceleration system for CNN applications. They proposed a tuned GPU acceleration framework to handle the gap caused by the uneven computing loads at different CNN layers and fixed computing capacity provisioning.…”

Section: Related Workmentioning

confidence: 99%

Grus

Wang

et al. 2021

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

Today’s GPU graph processing frameworks face scalability and efficiency issues as the graph size exceeds GPU-dedicated memory limit. Although recent GPUs can over-subscribe memory with Unified Memory (UM), they incur significant overhead when handling graph-structured data. In addition, many popular processing frameworks suffer sub-optimal efficiency due to heavy atomic operations when tracking the active vertices. This article presents Grus, a novel system framework that allows GPU graph processing to stay competitive with the ever-growing graph complexity. Grus improves space efficiency through a UM trimming scheme tailored to the data access behaviors of graph workloads. It also uses a lightweight frontier structure to further reduce atomic operations. With easy-to-use interface that abstracts the above details, Grus shows up to 6.4× average speedup over the state-of-the-art in-memory GPU graph processing framework. It allows one to process large graphs of 5.5 billion edges in seconds with a single GPU.

show abstract

Bridging the Semantic Gaps of GPU Acceleration for Scale-out CNN-based Big Data Processing

Cited by 17 publications

References 30 publications

Tensor slicing and optimization for multicore NPUs

Tensor slicing and optimization for multicore NPUs

TurboDL: Improving the CNN Training on GPU With Fine-Grained Multi-Streaming Scheduling

Grus

Contact Info

Product

Resources

About