On Multiplicative Integration with Recurrent Neural Networks

Wu, Yuhuai; Zhang, Saizheng; Zhang, Ying; Bengio, Yoshua; Salakhutdinov, Russ R.

doi:10.48550/arxiv.1606.06630

Cited by 11 publications

(4 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Surprisingly, there is no empirical difference between two options, before-Hadamard product and after-Hadamard product. This result may build a bridge to relate with studies on multiplicative integration with recurrent neural networks (Wu et al, 2016c).…”

Section: Non-linearitymentioning

confidence: 74%

“…(6) Note that using the activation function in low-rank bilinear pooling can be found in an implementation of simple baseline for the VQA dataset (Antol et al, 2015) without an interpretation of low-rank bilinear pooling. However, notably, Wu et al (2016c) studied learning behavior of multiplicative integration in RNNs with discussions and empirical evidences.…”

Section: Nonlinear Activationmentioning

confidence: 99%

“…Higher-order Boltzmann Machines (Memisevic & Hinton, 2007;2010) use Hadamard product to capture the interactions of input, output, and hidden representations for energy function. Wu et al (2016c) propose the recurrent neural networks using Hadamard product to integrate multiplicative interactions among hidden representations in the model. For details of these related works, please refer to Appendix D.…”

Section: Related Workmentioning

confidence: 99%

“…Note that, usually, x is an input state vector and h is an hidden state vector in recurrent neural networks. Wu et al (2016c) propose a new design to replace the additive expression with a multiplicative expression using Hadamard product as…”

Section: Appendixmentioning

confidence: 99%

See 3 more Smart Citations

Hadamard Product for Low-rank Bilinear Pooling

Kim,

On,

Lim

et al. 2016

Preprint

135

View full text Add to dashboard Cite

Bilinear models provide rich representations compared with linear models. They have been applied in various visual tasks, such as object recognition, segmentation, and visual question-answering, to get state-of-the-art performances taking advantage of the expanded representations. However, bilinear representations tend to be high-dimensional, limiting the applicability to computationally complex tasks. We propose low-rank bilinear pooling using Hadamard product for an efficient attention mechanism of multimodal learning. We show that our model outperforms compact bilinear pooling in visual question-answering tasks with the state-of-the-art results on the VQA dataset, having a better parsimonious property.

show abstract

Section: Non-linearitymentioning

confidence: 74%

Section: Nonlinear Activationmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Appendixmentioning

confidence: 99%

See 2 more Smart Citations

Hadamard Product for Low-rank Bilinear Pooling

Kim,

On,

Lim

et al. 2016

Preprint

135

View full text Add to dashboard Cite

show abstract

Structured Pruning of Large Language Models

Wang¹,

Wohlwend²,

Leí³

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

101

View full text Add to dashboard Cite

Large language models have recently achieved state of the art performance across a wide variety of natural language tasks. Meanwhile, the size of these models and their latency have significantly increased, which makes their usage costly, and raises an interesting question: do language models need to be large? We study this question through the lens of model compression. We present a generic, structured pruning approach by parameterizing each weight matrix using its low-rank factorization, and adaptively removing rank-1 components during training. On language modeling tasks, our structured approach outperforms other unstructured and block-structured pruning baselines at various compression levels, while achieving significant speedups during both training and inference. We also demonstrate that our method can be applied to pruning adaptive word embeddings in large language models, and to pruning the BERT model on several downstream fine-tuning classification benchmarks. 1 * Denotes equal contribution.

show abstract

Simple Recurrent Units for Highly Parallelizable Recurrence

Leí¹,

Wang²,

Ding³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

194

133

View full text Add to dashboard Cite

Common recurrent neural architectures scale poorly due to the intrinsic difficulty in parallelizing their state computations. In this work, we propose the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability. SRU is designed to provide expressive recurrence, enable highly parallelized implementation, and comes with careful initialization to facilitate training of deep models. We demonstrate the effectiveness of SRU on multiple NLP tasks. SRU achieves 5-9x speed-up over cuDNN-optimized LSTM on classification and question answering datasets, and delivers stronger results than LSTM and convolutional models. We also obtain an average of 0.7 BLEU improvement over the Transformer model (Vaswani et al., 2017) on translation by incorporating SRU into the architecture. 1

show abstract

On Multiplicative Integration with Recurrent Neural Networks

Cited by 11 publications

References 16 publications

Hadamard Product for Low-rank Bilinear Pooling

Hadamard Product for Low-rank Bilinear Pooling

Structured Pruning of Large Language Models

Simple Recurrent Units for Highly Parallelizable Recurrence

Contact Info

Product

Resources

About