Character-Level Language Modeling with Deeper Self-Attention

Al‐Rfou, Rami; Choe, Dokook; Constant, Noah; Guo, Mandy; Jones, Llion

doi:10.1609/aaai.v33i01.33013159

Cited by 245 publications

(187 citation statements)

References 16 publications

Supporting

Mentioning

183

Contrasting

Order By: Relevance

“…In the middle part of Table 2 , we vary both the number of layers, feed-forward dimension, and the residual dimension. First of all, the 12-layer (12,4096,512,8) model outperforms the 6-layer (6,8192,512,8) model, while having similar number of parameters, which seems to indicate that the depth effectively benefits Transformer language models. We also train an extreme model which has only 2 layers with wide dimensions (2,8192,2048,8).…”

Section: Hyper-parameter Tuningmentioning

confidence: 85%

Language Modeling with Deep Transformers

et al. 2019

View full text Add to dashboard Cite

We explore deep autoregressive Transformer models in language modeling for speech recognition. We focus on two aspects. First, we revisit Transformer model configurations specifically for language modeling. We show that well configured Transformer models outperform our baseline models based on the shallow stack of LSTM recurrent neural network layers. We carry out experiments on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level and 10K byte-pair encoding subword-level language modeling. We apply our wordlevel models to conventional hybrid speech recognition by lattice rescoring, and the subword-level models to attention based encoder-decoder models by shallow fusion. Second, we show that deep Transformer language models do not require positional encoding. The positional encoding is an essential augmentation for the self-attention mechanism which is invariant to sequence ordering. However, in autoregressive setup, as is the case for language modeling, the amount of information increases along the position dimension, which is a positional signal by its own. The analysis of attention weights shows that deep autoregressive selfattention models can automatically make use of such positional information. We find that removing the positional encoding even slightly improves the performance of these models.

show abstract

Section: Hyper-parameter Tuningmentioning

confidence: 85%

Language Modeling with Deep Transformers

et al. 2019

View full text Add to dashboard Cite

show abstract

“…Therefore, although the self-attention mechanism is less affected by the vanishing gradient problem compared to RNNs, the vanilla model is not able to fully exploit this optimization advantage. Second, though it is possible to use padding to respect the sentence or other semantic boundaries, in practice it has been standard practice to simply chunk long text into fixed-length segments due to improved efficiency (Peters et al, 2018;Devlin et al, 2018;Al-Rfou et al, 2018). However, simply chunking a sequence into fixed-length segments will lead to the context fragmentation problem as discussed in Section 1.…”

Section: Vanilla Transformer Language Modelsmentioning

confidence: 99%

“…For the first two blocks, we use a slightly smaller model (128M parameters). † indicates that the corresponding row is reduced to the same setting as the Transformer network in (Al-Rfou et al, 2018), except that two auxiliary losses are not implemented in our experiments. "PPL init" refers to using the same length as training.…”

Section: Relative Effective Context Lengthmentioning

confidence: 99%

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

Dai

Yang

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

2,430

1,489

View full text Add to dashboard Cite

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-ofthe-art results of bpc/perplexity to 0.99 on en-wiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch 1 . Jack W Rae, Chris Dyer, Peter Dayan, and Timothy P Lillicrap. 2018. Fast parametric learning with activation memorization. arXiv preprint arXiv:1803.10049.Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155.

show abstract

“…Furthermore, there is a study that predicts information from the middle layer of each layer of the language model and learns the errors occurring owing to the model [1]. The use of the information of the middle layer of transformer block is common to our research, but the information of each layer is not taken into account at the time of evaluation and is used only for learning.…”

Section: Using the Layer Representationsmentioning

confidence: 99%

Multi-Head Multi-Layer Attention to Deep Language Representations for Grammatical Error Detection

Kaneko¹,

Komachi²

2019

CyS

View full text Add to dashboard Cite

It is known that a deep neural network model pre-trained with large-scale data greatly improves the accuracy of various tasks, especially when there are resource constraints. However, the information needed to solve a given task can vary, and simply using the output of the final layer is not necessarily sufficient. Moreover, to our knowledge, exploiting large language representation models to detect grammatical errors has not yet been studied. In this work, we investigate the effect of utilizing information not only from the final layer but also from intermediate layers of a pre-trained language representation model to detect grammatical errors. We propose a multi-head multi-layer attention model that determines the appropriate layers in Bidirectional Encoder Representation from Transformers (BERT). The proposed method achieved the best scores on three datasets for grammatical error detection tasks, outperforming the current state-of-the-art method by 6.0 points on FCE, 8.2 points on CoNLL14, and 12.2 points on JFLEG in terms of F0.5. We also demonstrate that by using multi-head multi-layer attention, our model can exploit a broader range of information for each token in a sentence than a model that uses only the final layer's information.

show abstract

Character-Level Language Modeling with Deeper Self-Attention

Cited by 245 publications

References 16 publications

Language Modeling with Deep Transformers

Language Modeling with Deep Transformers

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

Multi-Head Multi-Layer Attention to Deep Language Representations for Grammatical Error Detection

Contact Info

Product

Resources

About