Performance portability in a real world application: PHAST applied to Caffe

Martínez, Pablo Antonio; Peccerillo, Biagio; Bartolini, Sandro; Garcia, Jose M.; Bernabé, Gregorio

doi:10.1177/10943420221077107

Cited by 2 publications

(1 citation statement)

References 28 publications

(47 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Where possible, nowadays the trend is to use GPUs, FPGAs, NPUs, and other ad-hoc accelerators for seeking higher performance/efficiency than CPUs [15]. GPUs' massively parallel hardware has been successfully employed in the im2col+gemm convolution implementation, but also in direct convolution [16], [17] recently. State-of-the-art convolutional accelerators (e.g., [18], [19]) use specific dataflow structures that can be seen as portions of direct convolution algorithm mapped in hardware, since tensors are processed spatially without applying any transformations.…”

Section: Introductionmentioning

confidence: 99%

Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors

et al. 2023

Self Cite

View full text Add to dashboard Cite

Nowadays, convolutional neural networks are among the most widely used types of deep learning networks thanks to their usefulness in many application domains. There are many efforts to find methods to increase their training and inference performance and efficiency. One of the most widely used technique to implement convolution consists of flattening tensors into 2D matrices and carrying out the operation through a matrix-matrix multiplication routine, which has highly optimized implementations in high-performance libraries. However, this kind of approach uses extra time and memory to transform and store the tensors involved. For this reason, direct convolution is becoming increasingly popular. Direct convolution can be implemented as a series of nested loops iterating over tensor dimensions and it does not require extra memory. In this work, we evaluate on various multi-core CPUs the performance and scalability effects deriving from different parallelization strategies, loop organizations, and SIMD-vectorization approaches with different compilers in relation with architectural aspects. We discuss each parameter thoroughly and distill our findings in a set of heuristics that can be used to quickly achieve a high-performance implementation in accordance to the underlying hardware and the characteristics of the convolutional layer at hand. By adopting a per-layer approach, we increase performance up to 60-70% compared to a static implementation for all the layers. Moreover, our results are comparable, or even better (up to 1.67× speedup) than matrix-matrix multiplication-based convolution in a multi-core system.

show abstract

Section: Introductionmentioning

confidence: 99%

Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors

et al. 2023

Self Cite

View full text Add to dashboard Cite

show abstract

POAS: a framework for exploiting accelerator level parallelism in heterogeneous environments

Martínez,

Bernabé,

García

2024

J Supercomput

View full text Add to dashboard Cite

In the era of heterogeneous computing, a new paradigm called accelerator level parallelism (ALP) has emerged. In ALP, accelerators are used concurrently to provide unprecedented levels of performance and energy efficiency. To reach that there are many problems to be solved, one of the most challenging being co-execution. In this paper, we present a new scheduling framework called POAS, a general method for providing co-execution to applications. Our proposal consists of four steps: predict, optimize, adapt and schedule. With POAS, an unseen application can be executed concurrently in ALP with little effort. We evaluate POAS on a heterogeneous environment consisting of CPUs, GPUs (CUDA cores), and XPUs (Tensor cores) on two different fields, namely linear algebra (matrix multiplication benchmark) and deep learning (convolution benchmark). Our experiments prove that POAS provides excellent performance and completes the tasks within a time very close to the optimal time for the hardware and applications used, with a negligible execution time overhead. Moreover, the POAS predictor performed exceptionally well, achieving very low RMSE values for both use cases. Therefore, POAS can be a valuable tool for fully exploiting ALP and improving overall performance over offloading in heterogeneous settings.

show abstract

Performance portability in a real world application: PHAST applied to Caffe

Cited by 2 publications

References 28 publications

Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors

Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors

POAS: a framework for exploiting accelerator level parallelism in heterogeneous environments

Contact Info

Product

Resources

About