Throughput optimizations for FPGA-based deep neural network inference

Posewsky, Thorbjorn; Ziener, Daniel

doi:10.1016/j.micpro.2018.04.004

Cited by 19 publications

(11 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2 is designed for GPU clusters, it also applies to other accelerators like TPU or FPGA. Existing measurements have shown that TPU [50] and FPGA [69] also exhibit a non-linear, monotonically-increasing relationship between t p and x, making it feasible to adopt LB-BSP in heterogeneous clusters with those hardware.…”

Section: A Drop-in Algorithm For Lb-bsp In Gpu Clustersmentioning

confidence: 99%

Semi-dynamic load balancing

Chen

Weng

Wang

et al. 2020

Proceedings of the 11th ACM Symposium on Cloud Computing

View full text Add to dashboard Cite

Machine learning (ML) models are increasingly trained in clusters with non-dedicated workers possessing heterogeneous resources. In such scenarios, model training efficiency can be negatively affected by stragglers-workers that run much slower than others. Efficient model training requires eliminating such stragglers, yet for modern ML workloads, existing load balancing strategies are inefficient and even infeasible. In this paper, we propose a novel strategy called semi-dynamic load balancing to eliminate stragglers of distributed ML workloads. The key insight is that ML workers shall be load-balanced at iteration boundaries, being nonintrusive to intra-iteration execution. We develop LB-BSP based on such an insight, which is an integrated worker coordination mechanism that adapts workers' load to their instantaneous processing capabilities by right-sizing the sample batches at the synchronization barriers. We have customdesigned the batch sizing algorithm respectively for CPU and GPU clusters based on their own characteristics. LB-BSP has been implemented as a Python module for ML frameworks like TensorFlow and PyTorch. Our EC2 deployment confirms that LB-BSP is practical, effective and lightweight , and is able to accelerating distributed training by up to 54%.

show abstract

Section: A Drop-in Algorithm For Lb-bsp In Gpu Clustersmentioning

confidence: 99%

Semi-dynamic load balancing

Chen

Weng

Wang

et al. 2020

Proceedings of the 11th ACM Symposium on Cloud Computing

View full text Add to dashboard Cite

show abstract

“…With the rapid development of semiconductor technology, the continuous increase in the scale of integrated circuits, and the continuous improvement of various development technology levels, English speech recognition technology has gradually become smaller after being combined with embedded systems based on DSP, FPGA, ASIC, and other devices. With the development of industrialization and practicality, the application field is also getting bigger and bigger [3]. As a modern information technology with extensive social and economic benefits, English speech recognition has made certain achievements, but there are still a series of problems when facing practical use.…”

Section: Introductionmentioning

confidence: 99%

Design and Implementation of Embedded Real-Time English Speech Recognition System Based on Big Data Analysis

Jin

Tsai

2021

Mathematical Problems in Engineering

View full text Add to dashboard Cite

This article uses Field Programmable Gate Array (FPGA) as a carrier and uses IP core to form a System on Programmable Chip (SOPC) English speech recognition system. The SOPC system uses a modular hardware system design method. Except for the independent development of the hardware acceleration module and its control module, the other modules are implemented by software or IP provided by Xilinx development tools. Hardware acceleration IP adopts a top-down design method, provides parallel operation of multiple operation components, and uses pipeline technology, which speeds up data operation, so that only one operation cycle is required to obtain an operation result. In terms of recognition algorithm, a more effective training algorithm is proposed, Genetic Continuous Hidden Markov Model (GA_CHMM), which uses genetic algorithm to directly train CHMM model. It is to find the optimal model by encoding the parameter values of the CHMM and performing operations such as selection, crossover, and mutation according to the fitness function. The optimal parameter value after decoding corresponds to the CHMM model, and then the English speech recognition is performed through the CHMM algorithm. This algorithm can save a lot of training time, thereby improving the recognition rate and speed. This paper studies the optimization of embedded system software. By studying the fixed-point software algorithm and the optimization of system storage space, the real-time response speed of the system has been reduced from about 10 seconds to an average of 220 milliseconds. Through the optimization of the CHMM algorithm, the real-time performance of the system is improved again, and the average time to complete the recognition is significantly shortened. At the same time, the system can achieve a recognition rate of over 90% when the English speech vocabulary is less than 200.

show abstract

“…Similarly, in [21], Yoon et al propose an expandable architecture capable of selective retraining when presented with new tasks and data. The continuous learning research community is mainly interested in scenarios [24] where datasets are unavailable in the form of neatly organized fully tagged datasets, but in an evolving application where new samples are made available at different times after the models are trained on the main dataset. An example of this would be a self-driving car trained on an object detection dataset receiving new data samples during deployment.…”

Section: Introductionmentioning

confidence: 99%

Automated CNN back-propagation pipeline generation for FPGA online training

Mazouz

Bridges

2021

J Real-Time Image Proc

View full text Add to dashboard Cite

Training of convolutional neural networks (CNNs) on embedded platforms to support on-device learning has become essential for the future deployment of CNNs on autonomous systems. In this work, we present an automated CNN training pipeline compilation tool for Xilinx FPGAs. We automatically generate multiple hardware designs from high-level CNN descriptions using a multi-objective optimization algorithm that explores the design space by exploiting CNN parallelism. These designs that trade-off resources for throughput allow users to tailor implementations to their hardware and applications. The training pipeline is generated based on the backpropagation (BP) equations of convolution which highlight an overlap in computation. We translate the overlap into hardware by reusing most of the forward pass (FP) pipeline reducing the resources overhead. The implementation uses a streaming interface that lends itself well to data streams and live feeds instead of static data reads from memory. Meaning, we do not use the standard array of processing elements (PEs) approach, which is efficient for offline inference, instead we translate the architecture into a pipeline where data is streamed through allowing for new samples to be read as they become available. We validate the results using the Zynq-7100 on three datasets and varying size architectures against CPU and GPU implementations. GPUs consistently outperform FPGAs in training times in batch processing scenarios, but in data stream scenarios, FPGA designs achieve a significant speedup compared to GPU and CPU when enough resources are dedicated to the learning task. A 2.8×, 5.8×, and 3× speed up over GPU was achieved on three architectures trained on MNIST, SVHN, and CIFAR-10 respectively.

show abstract

Throughput optimizations for FPGA-based deep neural network inference

Cited by 19 publications

References 15 publications

Semi-dynamic load balancing

Semi-dynamic load balancing

Design and Implementation of Embedded Real-Time English Speech Recognition System Based on Big Data Analysis

Automated CNN back-propagation pipeline generation for FPGA online training

Contact Info

Product

Resources

About