2017
DOI: 10.48550/arxiv.1708.02182
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Regularizing and Optimizing LSTM Language Models

Abstract: Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve as a fundamental building block for many sequence learning tasks, including machine translation, language modeling, and question answering. In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTMbased models. We propose the weight-dropped LSTM which uses DropConnect on hidden-tohidden weights as a form of recurrent regularization. Furth… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

8
288
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 166 publications
(296 citation statements)
references
References 26 publications
8
288
0
Order By: Relevance
“…Instead of taking the true gradients, one can train using gradients clipped in some way. This has proven to be of use in a variety of domains [Kaiser and Sutskever, 2015, Merity et al, 2017, Gehring et al, 2017 but will not fix all problems. As before, this calculation of the gradient is biased.…”
Section: Gradient Clippingmentioning
confidence: 99%
“…Instead of taking the true gradients, one can train using gradients clipped in some way. This has proven to be of use in a variety of domains [Kaiser and Sutskever, 2015, Merity et al, 2017, Gehring et al, 2017 but will not fix all problems. As before, this calculation of the gradient is biased.…”
Section: Gradient Clippingmentioning
confidence: 99%
“…It employs three novel techniques for finetuning the language models for various NLP tasks, which are discriminative fine-tuning, slanted triangular learning rates (STLR) and gradual unfreezing. AWD-LSTM language model [38,39],…”
Section: Ulmfitmentioning
confidence: 99%
“…Our work is similar to [18] that the bias information is encoded to trie structure. However, they require additional training of RNN-T [19] and LSTM-LM [20] to utilise bias information, whereas our method requires only a list of keywords we are interested in.…”
Section: Introductionmentioning
confidence: 99%