Proceedings of the 2019 Conference of the North 2019
DOI: 10.18653/v1/n19-1133
|View full text |Cite
|
Sign up to set email alerts
|

Star-Transformer

Abstract: Although Transformer has achieved great successes on many NLP tasks, its heavy structure with fully-connected attention connections leads to dependencies on large training data. In this paper, we present Star-Transformer, a lightweight alternative by careful sparsification. To reduce model complexity, we replace the fully-connected structure with a star-shaped topology, in which every two non-adjacent nodes are connected through a shared relay node. Thus, complexity is reduced from quadratic to linear, while p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
96
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 168 publications
(98 citation statements)
references
References 34 publications
2
96
0
Order By: Relevance
“…Finally, we find that sparse models' attention distributions remain largely similar to their values in the dense model. This ability to reduce weights in attention modules while maintaining nearly identical representations affirms other lines of work (Guo et al, 2019;Wang et al, 2020). Of the three attention types, encoder-decoder is pruned least (3.4), varies most across sparsities, and exhibits most within-model, inter-layer heterogeneity (5.4.3).…”
Section: Discussionsupporting
confidence: 83%
“…Finally, we find that sparse models' attention distributions remain largely similar to their values in the dense model. This ability to reduce weights in attention modules while maintaining nearly identical representations affirms other lines of work (Guo et al, 2019;Wang et al, 2020). Of the three attention types, encoder-decoder is pruned least (3.4), varies most across sparsities, and exhibits most within-model, inter-layer heterogeneity (5.4.3).…”
Section: Discussionsupporting
confidence: 83%
“…There has been methods that aim to speed up neural CRF (Tu and Gimpel, 2018), and to solve the Markov constraint of neural CRF. In particular, predicts a sequence of labels as a sequence to sequence problem; Guo et al (2019) further integrates global input information in encoding. Capturing non-local dependencies between labels, these methods, however, are slower compared with CRF.…”
Section: Related Workmentioning
confidence: 99%
“…Model Accuracy Plank et al (2016) 97.22 Huang et al (2015) 97.55 Ma and Hovy (2016) 97.55 97.53 97.51 Zhang et al (2018c) 97.55 Yasunaga et al (2018) 97.58 Xin et al (2018) 97.58 Transformer-softmax (Guo et al, 2019) 97.04 BiLSTM-softmax 97.51 BiLSTM-CRF 97.51 BiLSTM-LAN 97.65 As can be seen, a multi-layer model with larger hidden sizes does not give significantly better results compared to a 1-layer model with a hidden size of 400. We thus chose the latter for the final model.…”
Section: Development Experimentsmentioning
confidence: 99%
“…Since Transformer has become a promising model for diverse NLP tasks, there have been attempts to improve its architectural efficiency with two majority approaches. The first is to restrict dependencies between input tokens to reduce superfluous pair-wise calculations Guo et al, 2019b;Sukhbaatar et al, 2019a). The approach provides time efficiency during inference, but it does not address the heavy parameterization of Transformer.…”
Section: Towards a Lightweight Transformermentioning
confidence: 99%