Efficient Processing of Convolutional Neural Networks on SW26010

Zhang, Yi; Shu, Bing; Yin, Yong; Zhou, Yuanxin; Li, Shaodi; Wu, Junmin

doi:10.1007/978-3-030-30709-7_26

Cited by 2 publications

(8 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To obtain higher peak performance, modern processors not only integrate more and more cores, but also widely apply superscalar technology [29]. Many studies [11], [13], [30], [31] have focused on instruction-level optimization methods based on superscalar technology. The same is true of the SW26010 processor, which has two pipelines (P0 and P1) on one CPE.…”

Section: ) Register Blockingmentioning

confidence: 99%

“…The execution time of CNNs becomes long and unacceptable as larger data sets and more complex CNNs emerge. Because convolution accounts for more than 90% of the total computation in CNNs [2], highly efficient convolution algorithms on many-core processors have become a popular research direction in academia and industry.…”

mentioning

confidence: 99%

“…Many studies are devoted to improving the performance of convolution on GPUs [3,4] and CPUs [5,6], which has promoted the perfection of deep neural network libraries such as NVIDIA cuDNN [7] and Intel MKL-DNN [8]. Furthermore, the acceleration of convolution algorithms on other hardware platforms also attracts researchers to participate, such as Cambrian's DianNao series [9], Google's TPU [10], and SW26010 [2]. As the main contributor to the computational power of the world-class Sunway TaihuLight supercomputer, SW26010 [11] has several special architectural features such as user-controllable memory hierarchy, asynchronous direct memory access (DMA), on-chip regis-ter communication, and double-pipeline instruction execution.…”

mentioning

confidence: 99%

“…However, the current support of convolution on SW26010 is still rudimentary. The existing studies [2,12,13] deploy optimization methods by simply mapping convolution tasks into the whole CG (core group). They continually enhance the runtime peak performance of algorithms but rarely consider the adaptability for changeable convolution scenes.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Runtime Adaptive Matrix Multiplication for the SW26010 Many-Core Processor

Chi

et al. 2020

IEEE Access

View full text Add to dashboard Cite

The study of matrix multiplication on the emerging SW26010 processor is highly significant for many scientific and engineering applications. The state-of-the-art work from the swBLAS library, called SWMM, focuses mainly on the infrequent case involving special matrix dimensions and determines the execution action of matrix multiplication by one specified algorithm. To further adapt to various matrix shapes, in this article, we present a runtime adaptive matrix multiplication methodology, called RTAMM, which targets the features of the SW26010 architecture. The execution action of RTAMM is determined dynamically at runtime via several fundamental cost formulas and multiple sets of blocking factors, rather than determining the action at library generation time. With comprehensive trade-offs between the computation and data access, overall architecture-oriented optimization methods are introduced at three levels (macro, assistant, and micro) to fully exploit the computing capability of SW26010. The experiments show that RTAMM can achieve competitive peak performance compared with SWMM. Moreover, in tests on 6000 different matrix multiplication cases, RTAMM outperforms SWMM in 85.55% of the cases, and the improvements range from 5% to 308%, whereas RTAMM is slightly inferior to SWMM in only 1.28% of the cases. These results demonstrate that RTAMM has both great adaptability and considerable performance improvement.

show abstract

Section: ) Register Blockingmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Runtime Adaptive Matrix Multiplication for the SW26010 Many-Core Processor

Chi

et al. 2020

IEEE Access

View full text Add to dashboard Cite

show abstract

“…ShuffleFaceNet [11] is adapted from the efficient network ShuffleNetV2 [12], and similar to MobileFaceNet, global depth-wise convolution is used to output the facial feature vector. Based on the variable group convolutional proposed in VarGNet [13], VarGFaceNet [14] designed a compact yet high-accurate FR model. Using recursive knowledge distillation, VarGFaceNet [14] achieved 99.85% accuracy on the LFW benchmark and obtained the best results for the ICCV 2019 LFR challenge DeepGlint-Light track [7].…”

Section: Introductionmentioning

confidence: 99%

Attention‐based hierarchical pyramid feature fusion structure for efficient face recognition

Dai

Sun

Huang

et al. 2023

IET Image Processing

View full text Add to dashboard Cite

Deep convolutional neural networks (CNN) have become the main method for face recognition (FR). To deploy deep CNN models on embedded and mobile devices, several lightweight FR models have been proposed. However, multi-scale facial features are seldom considered in these approaches. To overcome this limitation, an attention-based hierarchical pyramid feature fusion (AHPF) structure was proposed in this paper. Specifically, hierarchical multi-scale features were directly extracted from the backbone based on its pyramidal hierarchy, and the bidirectional cross-scale connection was used to better combine the high-level global features with low-level local features. In addition, instead of simple concatenation or summation, an attention-based feature fusion mechanism was used to highlight the most recognizable facial patches, and to address the unequal contribution to the output during the fusing process. Based on the AHPF structure and efficient backbones, multiple sizes of lightweight FR models were presented, called HSFNet. After an extensive experimental evaluation involving 10 mainstream benchmarks, the proposed models consistently achieved state-of-the-art FR performance compared to other lightweight FR models with same level of model complexity. With only 0.659M parameters and 94.94M FLOPs, our HSFNet-05-M exhibited a performance competitive with recent top-ranked FR models containing up to 4M parameters and 500M FLOPs.

show abstract

Efficient Processing of Convolutional Neural Networks on SW26010

Cited by 2 publications

References 5 publications

Runtime Adaptive Matrix Multiplication for the SW26010 Many-Core Processor

Runtime Adaptive Matrix Multiplication for the SW26010 Many-Core Processor

Attention‐based hierarchical pyramid feature fusion structure for efficient face recognition

Contact Info

Product

Resources

About