2018
DOI: 10.48550/arxiv.1804.04849
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The unreasonable effectiveness of the forget gate

Jos van der Westhuizen,
Joan Lasenby

Abstract: Given the success of the gated recurrent unit, a natural question is whether all the gates of the long short-term memory (LSTM) network are necessary. Previous research has shown that the forget gate is one of the most important gates in the LSTM. Here we show that a forget-gate-only version of the LSTM with chronoinitialized biases, not only provides computational savings but outperforms the standard LSTM on multiple benchmark datasets and competes with some of the best contemporary models. Our proposed netwo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

1
13
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 11 publications
(14 citation statements)
references
References 17 publications
(27 reference statements)
1
13
0
Order By: Relevance
“…Greff et al [2016], through numerous experiments, find that the forget gate and the output activation function are the most critical components of the LSTM block and removing any of them impairs performance significantly. A similar conclusion was obtained in Van Der Westhuizen and Lasenby [2018] and a new cell called JANET was proposed, which is based on the LSTM but uses just the forget gate. The minimalistic designs of recurrent cells with only one forget gate were proposed in Zhou et al [2016], Heck and Salem [2017].…”
Section: Related Worksupporting
confidence: 70%
“…Greff et al [2016], through numerous experiments, find that the forget gate and the output activation function are the most critical components of the LSTM block and removing any of them impairs performance significantly. A similar conclusion was obtained in Van Der Westhuizen and Lasenby [2018] and a new cell called JANET was proposed, which is based on the LSTM but uses just the forget gate. The minimalistic designs of recurrent cells with only one forget gate were proposed in Zhou et al [2016], Heck and Salem [2017].…”
Section: Related Worksupporting
confidence: 70%
“…We empirically found such simplification sometimes generate a gap between theoretical properties on a gated RNN and its actual behavior. For example, while existing studies indicate that the gradient of loss with respect to inputs decrease exponentially as time goes back in gated RNNs [11], [17], such experimental behavior does not necessarily occur in a trained model (Figure 1). It is important to clarify when and how we can fill this gap for a more advanced understanding on RNNs and the construction of more sophisticated models.…”
Section: Introductionmentioning
confidence: 98%
“…To enable models to learn on long-term sequential data, RNNs with a gating mechanism (called gated RNNs), such as a Long Short-Term Memory (LSTM) [8] or Gated Recurrent Unit (GRU) [9], have been proposed. Gated RNNs control how much information of the past state is retained to the next state by means of a forget gate function [10], which is useful to mitigate the vanishing gradient problem [11]. Furthermore, the forget gate has recently been considered to take a role to represent a temporal characteristic in RNN models [12].…”
Section: Introductionmentioning
confidence: 99%
“…Several variants of the recurrent network have been proposed to alleviate the aforementioned problems, such as the Long Short-Term Memory network, or LSTM (Hochreiter & Schmidhuber, 1997;Gers et al, 1999); which have gained wide popularity and have been thoroughly studied (Jozefowicz et al, 2015;Greff et al, 2017). Various architectures have been proposed as extensions of LSTMs, such as the forget-gate only architecture introduced called the JANET (Van Der Westhuizen & Lasenby, 2018). These efforts are heavily dependent on memory and contain a memory-cell, which exists in order to retain information from the past and attenuates the effects of vanishing/exploding gradients.…”
Section: Introductionmentioning
confidence: 99%