Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018
DOI: 10.18653/v1/d18-1477
|View full text |Cite
|
Sign up to set email alerts
|

Simple Recurrent Units for Highly Parallelizable Recurrence

Abstract: Common recurrent neural architectures scale poorly due to the intrinsic difficulty in parallelizing their state computations. In this work, we propose the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability. SRU is designed to provide expressive recurrence, enable highly parallelized implementation, and comes with careful initialization to facilitate training of deep models. We demonstrate the effectiveness of SRU on multiple NLP tasks. SRU achieves 5-9x speed-up ov… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
133
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 194 publications
(141 citation statements)
references
References 34 publications
2
133
0
Order By: Relevance
“…We apply the same encoder architecture throughout all experiments. We use a 4 layer recurrent neural network, with SRU cells (Lei et al, 2018) and a hidden size of 128. We use pretrained GloVe embeddings (Pennington et al, 2014), which are fixed during training.…”
Section: Hyperparameters and Implementation Detailsmentioning
confidence: 99%
“…We apply the same encoder architecture throughout all experiments. We use a 4 layer recurrent neural network, with SRU cells (Lei et al, 2018) and a hidden size of 128. We use pretrained GloVe embeddings (Pennington et al, 2014), which are fixed during training.…”
Section: Hyperparameters and Implementation Detailsmentioning
confidence: 99%
“…While many of the models cited above implement their RNNs with an LSTM (Hochreiter and Schmidhuber, 1997), we instead use an SRU (Lei et al, 2018). SRU uses light recurrence, which makes it highly parallelizable, and Lei et al (2018) showed that it trains 5-9x faster than cuDNN LSTM. SRU also exhibits a significant speedup in inference time compared to LSTM (by a factor of 4.1x in our experiments), which is particularly relevant in a production setting.…”
Section: Related Workmentioning
confidence: 99%
“…SRU also exhibits a significant speedup in inference time compared to LSTM (by a factor of 4.1x in our experiments), which is particularly relevant in a production setting. Furthermore, Lei et al (2018) showed that SRU matches or exceeds the performance of models using LSTMs or the Transformer architecture (Vaswani et al, 2017) on a number of NLP tasks, meaning significant speed gains can be achieved without a drop in performance.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In this experiment, we fine-tune the BERT model on two standard text classification benchmarks: TREC (Li and Roth, 2002) and Sentiment Treebank (Socher et al, 2013). We then apply knowledge distillation to reduce the BERT model to a simple 4 layer, 256 units, SRU network (Lei et al, 2018). This is a typical multistage experiment with preprossessing, fine tuning, and distillation stages.…”
Section: Case Study: Bert Distillationmentioning
confidence: 99%