Comparison of feedforward and recurrent neural network language models

In the last several years, neural network models have significantly improved accuracy in a number of NLP tasks. However, one serious drawback that has impeded their adoption in production systems is the slow runtime speed of neural network models compared to alternate models, such as maximum entropy classifiers. In Devlin et al. (2014), the authors presented a simple technique for speeding up feed-forward embedding-based neural network models, where the dot product between each word embedding and part of the first hidden layer are pre-computed offline. However, this technique cannot be used for hidden layers beyond the first. In this paper, we explore a neural network architecture where the embedding layer feeds into multiple hidden layers that are placed "next to" one another so that each can be pre-computed independently. On a large scale language modeling task, this architecture achieves a 10x speedup at runtime and a significant reduction in perplexity when compared to a standard multilayer network.

show abstract

“…As is consistent with the literature, the recurrent network significantly outperforms any of the feed-forward models (Sundermeyer et al, 2013).…”

Section: High-order Lm Perplexitysupporting

confidence: 80%

Pre-Computable Multi-Layer Neural Network Language Models

Devlin

Quirk

Menezes

2015

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Most of the earlier latent semantic models for learning such vectors are designed for information retrieval (Deerwester et al 1990;Hofmann 1999;Blei et al 2003). In contrast, recent work on continuous space language models, which estimate the probability of a word sequence in a continuous space (Bengio et al 2003;Mikolov et al 2010), have advanced the state of the art in language modeling, outperforming the traditional n-gram model on speech recognition (Mikolov et al 2012;Sundermeyer et al 2013) and machine translation (Mikolov 2012;Auli et al 2013).…”

Section: Related Workmentioning

confidence: 99%

Learning Continuous Phrase Representations for Translation Modeling

Gao

Yih

et al. 2014

Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

102

View full text Add to dashboard Cite

This paper tackles the sparsity problem in estimating phrase translation probabilities by learning continuous phrase representations, whose distributed nature enables the sharing of related phrases in their representations. A pair of source and target phrases are projected into continuous-valued vector representations in a low-dimensional latent space, where their translation score is computed by the distance between the pair in this new space. The projection is performed by a neural network whose weights are learned on parallel training data. Experimental evaluation has been performed on two WMT translation tasks. Our best result improves the performance of a state-of-the-art phrase-based statistical machine translation system trained on WMT 2012 French-English data by up to 1.3 BLEU points.

show abstract

“…(Dungarwal et al, 2014) described the benefits of re-ranking the translation hypothesis using simple n-gram based language model. In recent years, the use of RNNLM have shown significant improvements over the traditional n-gram models (Sundermeyer et al, 2013). (Mikolov et al, 2010) and (Liu et al, 2014) have shown significant improvements in speech recognition accuracy using RNNLM .…”

Section: Related Workmentioning

confidence: 99%

Reducing the Impact of Data Sparsity in Statistical Machine Translation

Singla

Sachdeva

Bangalore

et al. 2014

Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation

View full text Add to dashboard Cite

Morphologically rich languages generally require large amounts of parallel data to adequately estimate parameters in a statistical Machine Translation(SMT) system. However, it is time consuming and expensive to create large collections of parallel data. In this paper, we explore two strategies for circumventing sparsity caused by lack of large parallel corpora. First, we explore the use of distributed representations in an Recurrent Neural Network based language model with different morphological features and second, we explore the use of lexical resources such as WordNet to overcome sparsity of content words.

show abstract

Comparison of feedforward and recurrent neural network language models

Cited by 123 publications

References 12 publications

Pre-Computable Multi-Layer Neural Network Language Models

Pre-Computable Multi-Layer Neural Network Language Models

Learning Continuous Phrase Representations for Translation Modeling

Reducing the Impact of Data Sparsity in Statistical Machine Translation

Contact Info

Product

Resources

About