E-Pur

Silfa, Franyell; Dot, Gem; Arnau, José-María; González, Antonio

doi:10.1145/3243176.3243184

Cited by 27 publications

(12 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This section presents the evaluation of the proposed fuzzy memoization technique for RNNs, implemented on top of E-PUR [30]. We refer to it as E-PUR+BM.…”

Section: Resultsmentioning

confidence: 99%

“…Typically, the number of elements in the weight matrices ranges from a few thousands to millions of elements and, thus, fetching them from on-chip buffers or main memory is one of the major sources of energy consumption. Not surprisingly, it accounts for up to 80% of the total energy consumption in state-of-the-art accelerators [30]. For this reason, a very effective way of saving energy in RNNs is to avoid fetching the synaptic weights.…”

Section: Motivationmentioning

confidence: 99%

“…We implement the proposed memoization scheme on top of EPUR, a state-of-the-art RNN accelerator for low power mobile applications [30]. A high-level block diagram of this accelerator is shown in Figure 13.…”

Section: Hardware Implementationmentioning

confidence: 99%

See 2 more Smart Citations

Neuron-Level Fuzzy Memoization in RNNs

Silfa

Dot

Arnau

et al. 2019

Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Self Cite

View full text Add to dashboard Cite

Recurrent Neural Networks (RNNs) are a key technology for applications such as automatic speech recognition or machine translation. Unlike conventional feed-forward DNNs, RNNs remember past information to improve the accuracy of future predictions and, therefore, they are very effective for sequence processing problems. For each application run, each recurrent layer is executed many times for processing a potentially large sequence of inputs (words, images, audio frames, etc.). In this paper, we make the observation that the output of a neuron exhibits small changes in consecutive invocations. We exploit this property to build a neuron-level fuzzy memoization scheme, which dynamically caches the output of each neuron and reuses it whenever it is predicted that the current output will be similar to a previously computed result, avoiding in this way the output computations. The main challenge in this scheme is determining whether the new neuron's output for the current input in the sequence will be similar to a recently computed result. To this end, we extend the recurrent layer with a much simpler Bitwise Neural Network (BNN), and show that the BNN and RNN outputs are highly correlated: if two BNN outputs are very similar, the corresponding outputs in the original RNN layer are likely to exhibit negligible changes. The BNN provides a low-cost and effective mechanism for deciding when fuzzy memoization can be applied with a small impact on accuracy. We evaluate our memoization scheme on top of a state-of-the-art accelerator for RNNs, for a variety of different neural networks from multiple application domains. We show that our technique avoids more than 24.2% of computations, resulting in 18.5% energy savings and 1.35x speedup on average. CCS CONCEPTS • Computer systems organization → Neural Networks; • Computing Methodologies → Machine Learning;

show abstract

“…This section presents the evaluation of the proposed fuzzy memoization technique for RNNs, implemented on top of E-PUR [30]. We refer to it as E-PUR+BM.…”

Section: Resultsmentioning

confidence: 99%

Section: Motivationmentioning

confidence: 99%

See 1 more Smart Citation

Neuron-Level Fuzzy Memoization in RNNs

Silfa

Dot

Arnau

et al. 2019

Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Self Cite

View full text Add to dashboard Cite

show abstract

“…While for most applications training is a one-time task, and can therefore be performed in the cloud, there is a growing demand for executing NN inference on embedded systems (so-called "edge" nodes), in order to enhance the features of many Internet of Things (IoT) applications [3]. In fact, edge inference could yield benefits in terms of data privacy, response latency and energy efficiency, as it would eliminate the need of transmitting high volumes of raw data to the cloud [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19].…”

Section: Introductionmentioning

confidence: 99%

“…One of the most popular approaches is to design custom hardware accelerators to implement the most critical operations involved in the inference phase, which are typically multiplications of large matrices and vectors, in a fast and efficient way. Most accelerators have been designed for convolutional neural networks (CNNs), due to their outstanding results in computer vision applications [3,6,7,12,16], but more recently, hardware acceleration of sequence-to-sequence models, such as RNNs and transformers, has also been investigated extensively [10,15,[17][18][19].…”

Section: Introductionmentioning

confidence: 99%

Sequence-To-Sequence Neural Networks Inference on Embedded Processors Using Dynamic Beam Search

2020

View full text Add to dashboard Cite

Sequence-to-sequence deep neural networks have become the state of the art for a variety of machine learning applications, ranging from neural machine translation (NMT) to speech recognition. Many mobile and Internet of Things (IoT) applications would benefit from the ability of performing sequence-to-sequence inference directly in embedded devices, thereby reducing the amount of raw data transmitted to the cloud, and obtaining benefits in terms of response latency, energy consumption and security. However, due to the high computational complexity of these models, specific optimization techniques are needed to achieve acceptable performance and energy consumption on single-core embedded processors. In this paper, we present a new optimization technique called dynamic beam search, in which the inference complexity is tuned to the difficulty of the processed input sequence at runtime. Results based on measurements on a real embedded device, and on three state-of-the-art deep learning models, show that our method is able to reduce the inference time and energy by up to 25% without loss of accuracy.

show abstract