The unreasonable effectiveness of the forget gate

van der Westhuizen, Jos; Lasenby, Joan

doi:10.48550/arxiv.1804.04849

Cited by 11 publications

(14 citation statements)

References 17 publications

(27 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Greff et al [2016], through numerous experiments, find that the forget gate and the output activation function are the most critical components of the LSTM block and removing any of them impairs performance significantly. A similar conclusion was obtained in Van Der Westhuizen and Lasenby [2018] and a new cell called JANET was proposed, which is based on the LSTM but uses just the forget gate. The minimalistic designs of recurrent cells with only one forget gate were proposed in Zhou et al [2016], Heck and Salem [2017].…”

Section: Related Worksupporting

confidence: 70%

Gates are not what you need in RNNs

Zakovskis¹,

Draguns²,

Eliza³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recurrent neural networks have flourished in many areas. Consequently, we can see new RNN cells being developed continuously, usually by creating or using gates in a new, original way. But what if we told you that gates in RNNs are redundant? In this paper, we propose a new recurrent cell called Residual Recurrent Unit (RRU) which beats traditional cells and does not employ a single gate. It is based on the residual shortcut connection together with linear transformations, ReLU, and normalization. To evaluate our cell's effectiveness, we compare its performance against the widely-used GRU and LSTM cells and the recently proposed Mogrifier LSTM on several tasks including, polyphonic music modeling, language modeling, and sentiment analysis. Our experiments show that RRU outperforms the traditional gated units on most of these tasks. Also, it has better robustness to parameter selection, allowing immediate application in new tasks without much tuning. We have implemented the RRU in TensorFlow, and the code is made available at https://github.com/LUMII-Syslab/RRU.Preprint. Under review.

show abstract

Section: Related Worksupporting

confidence: 70%

Gates are not what you need in RNNs

Zakovskis¹,

Draguns²,

Eliza³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We empirically found such simplification sometimes generate a gap between theoretical properties on a gated RNN and its actual behavior. For example, while existing studies indicate that the gradient of loss with respect to inputs decrease exponentially as time goes back in gated RNNs [11], [17], such experimental behavior does not necessarily occur in a trained model (Figure 1). It is important to clarify when and how we can fill this gap for a more advanced understanding on RNNs and the construction of more sophisticated models.…”

Section: Introductionmentioning

confidence: 98%

“…To enable models to learn on long-term sequential data, RNNs with a gating mechanism (called gated RNNs), such as a Long Short-Term Memory (LSTM) [8] or Gated Recurrent Unit (GRU) [9], have been proposed. Gated RNNs control how much information of the past state is retained to the next state by means of a forget gate function [10], which is useful to mitigate the vanishing gradient problem [11]. Furthermore, the forget gate has recently been considered to take a role to represent a temporal characteristic in RNN models [12].…”

Section: Introductionmentioning

confidence: 99%

Recurrent Neural Networks for Learning Long-term Temporal Dependencies with Reanalysis of Time Scale Representation

Ohno¹,

Kumagai²

2021

Preprint

View full text Add to dashboard Cite

Recurrent neural networks with a gating mechanism such as an LSTM or GRU are powerful tools to model sequential data. In the mechanism, a forget gate, which was introduced to control information flow in a hidden state in the RNN, has recently been re-interpreted as a representative of the time scale of the state, i.e., a measure how long the RNN retains information on inputs. On the basis of this interpretation, several parameter initialization methods to exploit prior knowledge on temporal dependencies in data have been proposed to improve learnability. However, the interpretation relies on various unrealistic assumptions, such as that there are no inputs after a certain time point. In this work, we reconsider this interpretation of the forget gate in a more realistic setting. We first generalize the existing theory on gated RNNs so that we can consider the case where inputs are successively given. We then argue that the interpretation of a forget gate as a temporal representation is valid when the gradient of loss with respect to the state decreases exponentially as time goes back. We empirically demonstrate that existing RNNs satisfy this gradient condition at the initial training phase on several tasks, which is in good agreement with previous initialization methods. On the basis of this finding, we propose an approach to construct new RNNs that can represent a longer time scale than conventional models, which will improve the learnability for long-term sequential data. We verify the effectiveness of our method by experiments with real-world datasets.1 Note that the idea is considered as an extension of temporal representation in older RNN models such as leaky units [13], [14].

show abstract

“…Several variants of the recurrent network have been proposed to alleviate the aforementioned problems, such as the Long Short-Term Memory network, or LSTM (Hochreiter & Schmidhuber, 1997;Gers et al, 1999); which have gained wide popularity and have been thoroughly studied (Jozefowicz et al, 2015;Greff et al, 2017). Various architectures have been proposed as extensions of LSTMs, such as the forget-gate only architecture introduced called the JANET (Van Der Westhuizen & Lasenby, 2018). These efforts are heavily dependent on memory and contain a memory-cell, which exists in order to retain information from the past and attenuates the effects of vanishing/exploding gradients.…”

Section: Introductionmentioning

confidence: 99%

A Dynamically Controlled Recurrent Neural Network for Modeling Dynamical Systems

Fu,

Saab,

Ray

et al. 2019

Preprint

View full text Add to dashboard Cite

This work proposes a novel neural network architecture, called the Dynamically Controlled Recurrent Neural Network (DCRNN), specifically designed to model dynamical systems that are governed by ordinary differential equations (ODEs). The current state vectors of these types of dynamical systems only depend on their state-space models, along with the respective inputs and initial conditions. Long Short-Term Memory (LSTM) networks, which have proven to be very effective for memory-based tasks, may fail to model physical processes as they tend to memorize, rather than learn how to capture the information on the underlying dynamics. The proposed DCRNN includes learnable skipconnections across previously hidden states, and introduces a regularization term in the loss function by relying on Lyapunov stability theory. The regularizer enables the placement of eigenvalues of the transfer function induced by the DCRNN to desired values, thereby acting as an internal controller for the hidden state trajectory. The results show that, for forecasting a chaotic dynamical system, the DCRNN outperforms the LSTM in 100 out of 100 randomized experiments by reducing the mean squared error of the LSTM's forecasting by 80.0% ± 3.0%.

show abstract

The unreasonable effectiveness of the forget gate

Cited by 11 publications

References 17 publications

Gates are not what you need in RNNs

Gates are not what you need in RNNs

Recurrent Neural Networks for Learning Long-term Temporal Dependencies with Reanalysis of Time Scale Representation

A Dynamically Controlled Recurrent Neural Network for Modeling Dynamical Systems

Contact Info

Product

Resources

About