A high-throughput reconfigurable processing array for neural networks

Wu, Ephrem; Zhang, Xiaoqian; Berman, David B.; Cho, Inkeun

doi:10.23919/fpl.2017.8056794

Cited by 30 publications

(23 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…But existing designs usually work at 100-400MHz [17,38,49,73,77]. As claimed in [69], the working frequency is limited by the routing between on-chip SRAM and DSP units. e design in [69] uses di erent working frequencies for DSP units and surrounding logic.…”

Section: Frequency Optimizationmentioning

confidence: 99%

“…As claimed in [69], the working frequency is limited by the routing between on-chip SRAM and DSP units. e design in [69] uses di erent working frequencies for DSP units and surrounding logic. Neighbor slices to each DSP unit are used as local RAMs to separate the clock domain.…”

Section: Frequency Optimizationmentioning

confidence: 99%

“…Neighbor slices to each DSP unit are used as local RAMs to separate the clock domain. e prototype design in [69] achieves the peak DSP working frequency at 741MHz and 891MHz on FPGA chips of di erent speed grades. Xilinx has also proposed the CHaiDNN-v2 [1] and xfDNN [2] with this technique and achieves up to 700MHz DSP working frequency.…”

Section: Frequency Optimizationmentioning

confidence: 99%

See 2 more Smart Citations

[DL] A Survey of FPGA-based Neural Network Inference Accelerators

Guo

Zeng

et al. 2019

ACM Trans. Reconfigurable Technol. Syst.

227

141

View full text Add to dashboard Cite

Recent researches on neural network have shown signi cant advantage in machine learning over traditional algorithms based on handcra ed features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the high computation and storage complexity of neural network inference poses great di culty on its application. CPU platforms are hard to o er enough computation capacity. GPU platforms are the rst choice for neural network process because of its high computation capacity and easy to use development frameworks.On the other hand, FPGA-based neural network inference accelerator is becoming a research topic. With speci cally designed hardware, FPGA is the next possible solution to surpass GPU in speed and energy eciency. Various FPGA-based accelerator designs have been proposed with so ware and hardware optimization techniques to achieve high speed and energy e ciency. In this paper, we give an overview of previous work on neural network inference accelerators based on FPGA and summarize the main techniques used. An investigation from so ware to hardware, from circuit level to system level is carried out to complete analysis of FPGA-based neural network inference accelerator design and serves as a guide to future work. K. Guo et al.But the computation and storage complexity of NN models are high. In Table 1, we list the number of operations, number of parameters (add or multiplication), and top-1 accuracy on ImageNet dataset [50] of state-of-the-art CNN models. Take CNN as an example. e largest CNN model for a 224 × 224 image classi cation requires up to 39 billion oating point operations (FLOP) and more than 500MB model parameters [56]. As the computation complexity is proportional to the input image size, processing images with higher resolutions may need more than 100 billion operations. Latest work like MobileNet [24] and Shu eNet [79] are trying to reduce the network size with advanced network structures, but with obvious accuracy loss. e balance between the size of NN models and accuracy is still an open question today. In some cases, the large model size hinders the application of NN, especially in power limited or latency critical scenarios. erefore, choosing a proper computation platform for neural-network-based applications is essential. A typical CPU can perform 10-100G FLOP per second, and the power e ciency is usually below 1GOP/J. So CPUs are hard to meet the high performance requirements in cloud applications nor the low power requirements in mobile applications. In contrast, GPUs o er up to 10TOP/s peak performance and are good choices for high performance neural network applications. Development frameworks like Ca e [26] and Tensor ow [4] also o er easy-to-use interfaces which makes GPU the rst choice of neural network acceleration.Besides CPUs and GPUs, FPGAs are becoming a platform candidate to achieve energy e cient neural network processing. With a neural network oriented hardware design, FPGAs can implement high parallelism and make use of the pro...

show abstract

Section: Frequency Optimizationmentioning

confidence: 99%

Section: Frequency Optimizationmentioning

confidence: 99%

Section: Frequency Optimizationmentioning

confidence: 99%

See 1 more Smart Citation

[DL] A Survey of FPGA-based Neural Network Inference Accelerators

Guo

Zeng

et al. 2019

ACM Trans. Reconfigurable Technol. Syst.

227

141

View full text Add to dashboard Cite

show abstract

“…Table I lists the resource utilization for AlexNet/GoogLeNet. Each type of resource exceeds 70% of the total, thus making it difficult to reach the maximum frequency of 661 MHz in [22]. Finally, at the peak performance of 4.2 TOP/s with 16-bit quantization, 500 MHz is used for the EPEs, and 250 MHz is used for the others.…”

Section: A Experimental Setupmentioning

confidence: 99%

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks

Gao¹,

Wang²,

Miao³

et al. 2019

2019 29th International Conference on Field Programmable Logic and Applications (FPL)

Self Cite

View full text Add to dashboard Cite

Intensive computation is entering data centers with multiple workloads of deep learning. To balance the compute efficiency, performance, and total cost of ownership (TCO), the use of a field-programmable gate array (FPGA) with reconfigurable logic provides an acceptable acceleration capacity and is compatible with diverse computation-sensitive tasks in the cloud. In this paper, we develop an FPGA acceleration platform that leverages a unified framework architecture for generalpurpose convolutional neural network (CNN) inference acceleration at a data center. To overcome the computation bound, 4,096 DSPs are assembled and shaped as supertile units (SUs) for different types of convolution, which provide up to 4.2 TOP/s 16bit fixed-point performance at 500 MHz. The interleaved-taskdispatching method is proposed to map the computation across the SUs, and the memory bound is solved by a dispatchingassembling buffering model and broadcast caches. For various non-convolution operators, a filter processing unit is designed for general-purpose filter-like/pointwise operators. In the experiment, the performances of CNN models running on server-class CPUs, a GPU, and an FPGA are compared. The results show that our design achieves the best FPGA peak performance and a throughput at the same level as that of the state-of-the-art GPU in data centers, with more than 50 times lower latency.

show abstract

“…The I/O characteristics of the approach is not reported quantitatively. Wu et al[33] present a highly specialized architecture for maximizing DSP usage and frequency of 16 bit integer…”

mentioning

confidence: 99%

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Licht

Kwasniewski

Hoefler

2020

Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

Data movement is the dominating factor affecting performance and energy in modern computing systems. Consequently, many algorithms have been developed to minimize the number of I/O operations for common computing patterns. Matrix multiplication is no exception, and lower bounds have been proven and implemented both for shared and distributed memory systems. Reconfigurable hardware platforms are a lucrative target for I/O minimizing algorithms, as they offer full control of memory accesses to the programmer. While bounds developed in the context of fixed architectures still apply to these platforms, the spatially distributed nature of their computational and memory resources requires a decentralized approach to optimize algorithms for maximum hardware utilization. We present a model to optimize matrix multiplication for FPGA platforms, simultaneously targeting maximum performance and minimum off-chip data movement, within constraints set by the hardware. We map the model to a concrete architecture using a high-level synthesis tool, maintaining a high level of abstraction, allowing us to support arbitrary data types, and enables maintainability and portability across FPGA devices. Kernels generated from our architecture are shown to offer competitive performance in practice, scaling with both compute and memory resources. We offer our design as an open source project 1 to encourage the open development of linear algebra and I/O minimizing algorithms on reconfigurable hardware platforms.c c c c c c c c no store required of par�al productsFigure 1: (a) MMM CDAG, and (b) subcomputation V i .yields fully deterministic behavior in the circuit: accessing memory, both on-chip and off-chip, is always done explicitly, rather than by a cache replacement scheme fixed by the hardware. The models established so far, however, pose a challenge for their applicability on FPGAs. They often rely on abstracting away many hardware details, assuming several idealized processing units with local memory and all-to-all communication [2,5,8,9]. Those assumptions do not hold for FPGAs, where the physical area size of custom-designed processing elements (PEs) and their layout are among most important concerns in designing efficient FPGA implementations [16]. Therefore, performance modeling for reconfigurable architectures requires taking constraints like logic resources, fan-out, routing, and on-chip memory characteristics into account.With an ever-increasing diversity in available hardware platforms, and as low-precision arithmetic and exotic data types are becoming key in modern DNN [17] and linear solver [18] applications, extensibility and flexibility of hardware architectures will be crucial to stay competitive. Existing high-performance FPGA implementations [19,20] are implemented in hardware description languages (HDLs), which drastically constrains their maintenance, reuse, generalizability, and portability. Furthermore, the source code is not disclosed, such that third-party users cannot benefit from the kernel or build on the archi...

show abstract

A high-throughput reconfigurable processing array for neural networks

Cited by 30 publications

References 11 publications

[DL] A Survey of FPGA-based Neural Network Inference Accelerators

[DL] A Survey of FPGA-based Neural Network Inference Accelerators

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Contact Info

Product

Resources

About