2019
DOI: 10.1609/aaai.v33i01.33013159
|View full text |Cite
|
Sign up to set email alerts
|

Character-Level Language Modeling with Deeper Self-Attention

Abstract: LSTMs and other RNN variants have shown strong performance on character-level language modeling. These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their ability to remember long-term contexts. In this paper, we show that a deep (64-layer) transformer model (Vaswani et al. 2017) with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks: 1.13 bits per character on text… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
183
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
3
3

Relationship

0
9

Authors

Journals

citations
Cited by 245 publications
(187 citation statements)
references
References 16 publications
3
183
1
Order By: Relevance
“…In the middle part of Table 2 , we vary both the number of layers, feed-forward dimension, and the residual dimension. First of all, the 12-layer (12,4096,512,8) model outperforms the 6-layer (6,8192,512,8) model, while having similar number of parameters, which seems to indicate that the depth effectively benefits Transformer language models. We also train an extreme model which has only 2 layers with wide dimensions (2,8192,2048,8).…”
Section: Hyper-parameter Tuningmentioning
confidence: 85%
“…In the middle part of Table 2 , we vary both the number of layers, feed-forward dimension, and the residual dimension. First of all, the 12-layer (12,4096,512,8) model outperforms the 6-layer (6,8192,512,8) model, while having similar number of parameters, which seems to indicate that the depth effectively benefits Transformer language models. We also train an extreme model which has only 2 layers with wide dimensions (2,8192,2048,8).…”
Section: Hyper-parameter Tuningmentioning
confidence: 85%
“…Therefore, although the self-attention mechanism is less affected by the vanishing gradient problem compared to RNNs, the vanilla model is not able to fully exploit this optimization advantage. Second, though it is possible to use padding to respect the sentence or other semantic boundaries, in practice it has been standard practice to simply chunk long text into fixed-length segments due to improved efficiency (Peters et al, 2018;Devlin et al, 2018;Al-Rfou et al, 2018). However, simply chunking a sequence into fixed-length segments will lead to the context fragmentation problem as discussed in Section 1.…”
Section: Vanilla Transformer Language Modelsmentioning
confidence: 99%
“…For the first two blocks, we use a slightly smaller model (128M parameters). † indicates that the corresponding row is reduced to the same setting as the Transformer network in (Al-Rfou et al, 2018), except that two auxiliary losses are not implemented in our experiments. "PPL init" refers to using the same length as training.…”
Section: Relative Effective Context Lengthmentioning
confidence: 99%
“…Furthermore, there is a study that predicts information from the middle layer of each layer of the language model and learns the errors occurring owing to the model [1]. The use of the information of the middle layer of transformer block is common to our research, but the information of each layer is not taken into account at the time of evaluation and is used only for learning.…”
Section: Using the Layer Representationsmentioning
confidence: 99%