Star-Transformer

Guo, Qinglan; Qiu, Xipeng; Liu, Pengfei; Shao, Yunfan; Xue, Xiangyang; Zhang, Zheng

doi:10.18653/v1/n19-1133

Cited by 168 publications

(98 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, we find that sparse models' attention distributions remain largely similar to their values in the dense model. This ability to reduce weights in attention modules while maintaining nearly identical representations affirms other lines of work (Guo et al, 2019;Wang et al, 2020). Of the three attention types, encoder-decoder is pruned least (3.4), varies most across sparsities, and exhibits most within-model, inter-layer heterogeneity (5.4.3).…”

Section: Discussionsupporting

confidence: 83%

Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation

Rajiv

Zhao

2020

Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

View full text Add to dashboard Cite

Recent work on the lottery ticket hypothesis has produced highly sparse Transformers for NMT while maintaining BLEU. However, it is unclear how such pruning techniques affect a model's learned representations. By probing Transformers with more and more lowmagnitude weights pruned away, we find that complex semantic information is first to be degraded. Analysis of internal activations reveals that higher layers diverge most over the course of pruning, gradually becoming less complex than their dense counterparts. Meanwhile, early layers of sparse models begin to perform more encoding. Attention mechanisms remain remarkably consistent as sparsity increases.

show abstract

Section: Discussionsupporting

confidence: 83%

Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation

Rajiv

Zhao

2020

Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

View full text Add to dashboard Cite

show abstract

“…There has been methods that aim to speed up neural CRF (Tu and Gimpel, 2018), and to solve the Markov constraint of neural CRF. In particular, predicts a sequence of labels as a sequence to sequence problem; Guo et al (2019) further integrates global input information in encoding. Capturing non-local dependencies between labels, these methods, however, are slower compared with CRF.…”

Section: Related Workmentioning

confidence: 99%

“…Model Accuracy Plank et al (2016) 97.22 Huang et al (2015) 97.55 Ma and Hovy (2016) 97.55 97.53 97.51 Zhang et al (2018c) 97.55 Yasunaga et al (2018) 97.58 Xin et al (2018) 97.58 Transformer-softmax (Guo et al, 2019) 97.04 BiLSTM-softmax 97.51 BiLSTM-CRF 97.51 BiLSTM-LAN 97.65 As can be seen, a multi-layer model with larger hidden sizes does not give significantly better results compared to a 1-layer model with a hidden size of 400. We thus chose the latter for the final model.…”

Section: Development Experimentsmentioning

confidence: 99%

Hierarchically-Refined Label Attention Network for Sequence Labeling

Cui¹,

Zhang²

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

CRF has been used as a powerful model for statistical sequence labeling. For neural sequence labeling, however, BiLSTM-CRF does not always lead to better results compared with BiLSTM-softmax local classification. This can be because the simple Markov label transition model of CRF does not give much information gain over strong neural encoding. For better representing label sequences, we investigate a hierarchically-refined label attention network, which explicitly leverages label embeddings and captures potential long-term label dependency by giving each word incrementally refined label distributions with hierarchical attention. Results on POS tagging, NER and CCG supertagging show that the proposed model not only improves the overall tagging accuracy with similar number of parameters, but also significantly speeds up the training and testing compared to BiLSTM-CRF.

show abstract

“…Since Transformer has become a promising model for diverse NLP tasks, there have been attempts to improve its architectural efficiency with two majority approaches. The first is to restrict dependencies between input tokens to reduce superfluous pair-wise calculations Guo et al, 2019b;Sukhbaatar et al, 2019a). The approach provides time efficiency during inference, but it does not address the heavy parameterization of Transformer.…”

Section: Towards a Lightweight Transformermentioning

confidence: 99%

Scale down Transformer by Grouping Features for a Lightweight Character-level Language Model

Park

Lee

Cha

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

This paper introduces a method that efficiently reduces the computational cost and parameter size of Transformer. The proposed model, refer to as Group-Transformer, splits feature space into multiple groups, factorizes the calculation paths, and reduces computations for the group interaction. Extensive experiments on two benchmark tasks, enwik8 and text8, prove our model's effectiveness and efficiency in small-scale Transformers. To the best of our knowledge, Group-Transformer is the first attempt to design Transformer with the group strategy, widely used for efficient CNN architectures.

show abstract

Star-Transformer

Cited by 168 publications

References 34 publications

Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation

Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation

Hierarchically-Refined Label Attention Network for Sequence Labeling

Scale down Transformer by Grouping Features for a Lightweight Character-level Language Model

Contact Info

Product

Resources

About