2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015
DOI: 10.1109/icassp.2015.7179001
|View full text |Cite
|
Sign up to set email alerts
|

Scaling recurrent neural network language models

Abstract: This paper investigates the scaling properties of Recurrent Neural Network Language Models (RNNLMs). We discuss how to train very large RNNs on GPUs and address the questions of how RNNLMs scale with respect to model size, training-set size, computational costs and memory. Our analysis shows that despite being more costly to train, RNNLMs obtain much lower perplexities on standard benchmarks than n-gram models. We train the largest known RNNs and present relative word error rates gains of 18% on an ASR task. W… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
41
0

Year Published

2016
2016
2021
2021

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 50 publications
(41 citation statements)
references
References 11 publications
0
41
0
Order By: Relevance
“…The neural language model is an LSTM RNN with 2 layers and 1000 hidden states. It has a target vocabulary of 100K and is trained using noise-contrastive estimation (Mnih and Teh, 2012;Vaswani et al, 2013;Baltescu and Blunsom, 2015;Williams et al, 2015). Additionally, it is trained using dropout with a dropout probability of 0.2 as suggested by Zaremba et al (2014).…”
Section: Re-scoring Resultsmentioning
confidence: 99%
“…The neural language model is an LSTM RNN with 2 layers and 1000 hidden states. It has a target vocabulary of 100K and is trained using noise-contrastive estimation (Mnih and Teh, 2012;Vaswani et al, 2013;Baltescu and Blunsom, 2015;Williams et al, 2015). Additionally, it is trained using dropout with a dropout probability of 0.2 as suggested by Zaremba et al (2014).…”
Section: Re-scoring Resultsmentioning
confidence: 99%
“…Morin and Bengio (2005) propose hierarchical softmax, where at each step log 2 V binary classifications are performed instead of a single classification on a large number of classes. Gutmann and Hyvärinen (2010) propose noise-contrastive estimation which discriminate between positive labels and k (k << V ) negative labels sampled from a distribution, and is applied successfully on natural language processing tasks (Mnih and Teh, 2012;Vaswani et al, 2013;Williams et al, 2015;Zoph et al, 2016). Although these two approaches provide good speedups for training, they still suffer at test time.…”
Section: Introductionmentioning
confidence: 99%
“…We thus combined our recently released "Eesen" ASR toolkit [21] with an Open Source "transcriber" [22] and segmentation [23] framework, and added several additional functionality, releasing the system with English acoustic models trained on TEDLIUM data [24], and a large general purpose language model [25].…”
Section: The "Eesen Transcriber" Vmmentioning
confidence: 99%
“…The "Eesen Transcriber" is currently distributed with a TEDLIUM version 2 [24] system, which delivers a WER of less than 15% on the development set, using the pruned Cantab LM [25]. It decodes about four times faster than real-time on a single processor, and can be retrained from scratch within 6 days on a single GPU.…”
Section: Performance and Characteristicsmentioning
confidence: 99%