Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

Xu, Rui; Ma, Shaoping; Wang, Yaohua; Chen, Xinhai; Guo, Yang

doi:10.1145/3460776

Cited by 24 publications

(3 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to improve the efficiency of CNNs, different hardware paradigms have been developed with the aim of massively reducing the volume of information movement. Numerous dataflow architectures have been proposed [ 8 , 9 , 10 , 11 , 12 , 13 , 14 ], whereby an array of processing elements (PEs), each containing their own limited memory called a register file, minimise the flow of data by storing intermediate results locally as well as operating on data received from their neighbouring PEs. Similarly, tensor processing units have been proposed which accelerate matrix multiplication through the cascading of multiplication and sum over a systolic array of PEs [ 15 ].…”

Section: Introductionmentioning

confidence: 99%

End-to-End Implementation of a Convolutional Neural Network on a 3D-Integrated Image Sensor with Macropixel Array

Lepecq

Dalgaty

Fabre

et al. 2023

Sensors

View full text Add to dashboard Cite

Three-dimensional-integrated focal-plane array image processor chips offer new opportunities to implement highly parallelised computer vision algorithms directly inside sensors. Neural networks in particular can perform highly complex machine vision tasks, and therefore their efficient implementation in such imagers are of significant interest. However, studies with existing pixel-processor array chips have focused on the implementation of a subset of neural network components—notably convolutional kernels—on pixel processor arrays. In this work, we implement a continuous end-to-end pipeline for a convolutional neural network from the digitisation of incoming photons to the output prediction vector on a macropixel processor array chip (where a single processor acts on group of pixels). Our implementation performs inference at a rate between 265 and 309 frames per second, directly inside of the sensor, by exploiting the different levels of parallelism available.

show abstract

Section: Introductionmentioning

confidence: 99%

End-to-End Implementation of a Convolutional Neural Network on a 3D-Integrated Image Sensor with Macropixel Array

Lepecq

Dalgaty

Fabre

et al. 2023

Sensors

View full text Add to dashboard Cite

show abstract

“…To increase applicability, This work was supported by a research grant of Siemens EDA to Democritus University of Thrace for "HLS Research for Systems-on-Chip". flexibility, and PE utilization, configurable SAs can support various dataflow types [8], [9], [7].…”

Section: Introductionmentioning

confidence: 99%

FusedGCN: A Systolic Three-Matrix Multiplication Architecture for Graph Convolutional Networks

Peltekis

Filippas

Nicopoulos

et al. 2022

2022 IEEE 33rd International Conference on Application-Specific Systems, Architectures and Processors (ASAP)

View full text Add to dashboard Cite

Systolic Array (SA) architectures are well suited for accelerating matrix multiplications through the use of a pipelined array of Processing Elements (PEs) communicating with local connections and pre-orchestrated data movements. Even though most of the dynamic power consumption in SAs is due to multiplications and additions, pipelined data movement within the SA constitutes an additional important contributor. The goal of this work is to reduce the dynamic power consumption associated with the feeding of data to the SA, by synergistically applying bus-invert coding and zero-value clock gating. By exploiting salient attributes of state-of-the-art CNNs, such as the value distribution of the weights, the proposed SA applies appropriate encoding only to the data that exhibits high switching activity. Similarly, when one of the inputs is zero, unnecessary operations are entirely skipped. This selectively targeted, application-aware encoding approach is demonstrated to reduce the dynamic power consumption of data streaming in CNN applications using Bfloat16 arithmetic by 1%-19%. This translates to an overall dynamic power reduction of 6.2%-9.4%.

show abstract

“…Unlike the state-of-the-art works, where the standard algorithm for convolution is modified to improve the performance, the proposed convolver can receive raw feature maps and weights to deliver an output whose precision is only limited by the number of bits in data representation. The convolution operation with systolic arrays is usually conducted as a matrices multiplication in related works [42], which requires a previous processing stage, and irregular data memory access [10]. Tiles-based data processing is often used too, but a data processing stage after the convolution computing is implied to compensate for the quantization error due to not being able to apply the filters on the edges of the tiles [37].…”

Section: Introductionmentioning

confidence: 99%

Flexible Convolver for Convolutional Neural Networks Deployment onto Hardware-Oriented Applications

et al. 2022

View full text Add to dashboard Cite

This paper introduces a flexible convolver capable of adapting to the different convolution layer configurations of state-of-the-art Convolution Neural Networks (CNNs). The use of two proposed programmable components achieves this adaptability. A Programmable Line Buffer (PLB) based on Programmable Shift Registers (PSRs) allows the generation of the required convolution masks required for each processed CNN layer. The convolution layer computing is performed through a proposed programmable systolic array configured according to the target device resources. In order to maximize the device resource usage and to achieve a shortened processing time, the filter, data, and loop parallelisms are leveraged. These characteristics allow the described architecture to be scalable and implemented on any FPGA device targeting different applications. The convolver description was written in VHDL using the Intel Cyclone V 5CSXFC6D6F31C6N device as a reference. The experimental results show that the proposed computing method allows the processing of any CNN without requiring special adaptation for a specific application since the standard convolution algorithm is used. The proposed flexible convolver achieves competitive performance compared with those reported in related works.

show abstract

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

Cited by 24 publications

References 28 publications

End-to-End Implementation of a Convolutional Neural Network on a 3D-Integrated Image Sensor with Macropixel Array

End-to-End Implementation of a Convolutional Neural Network on a 3D-Integrated Image Sensor with Macropixel Array

FusedGCN: A Systolic Three-Matrix Multiplication Architecture for Graph Convolutional Networks

Flexible Convolver for Convolutional Neural Networks Deployment onto Hardware-Oriented Applications

Contact Info

Product

Resources

About