2019
DOI: 10.48550/arxiv.1911.03898
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Understanding Multi-Head Attention in Abstractive Summarization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 0 publications
0
4
0
Order By: Relevance
“…The module dynamically adjusted the degree of attention to different positions in the input sequence by weighting the attention weights. When dealing with application categories with imbalanced sample sizes, this characteristic of MHSA allowed the model to focus on key features of the categories with small sample sizes, aiding in capturing differential information among different application category traffic data and improving the ability to recognize categories with small sample sizes [28].…”
Section: Multi-head Self-attention Modulementioning
confidence: 99%
“…The module dynamically adjusted the degree of attention to different positions in the input sequence by weighting the attention weights. When dealing with application categories with imbalanced sample sizes, this characteristic of MHSA allowed the model to focus on key features of the categories with small sample sizes, aiding in capturing differential information among different application category traffic data and improving the ability to recognize categories with small sample sizes [28].…”
Section: Multi-head Self-attention Modulementioning
confidence: 99%
“…Recently, the attention mechanism has helped solve many research problems [ 40 ]. This mechanism is a core part of the transformer architecture introduced by Vaswani et al, 2017 [ 41 ]. Multi-Head Attention is essential because, in simple self-attention, the model learns only one set of attention weights, which can be limiting.…”
Section: Preliminariesmentioning
confidence: 99%
“…Most existing big models adopt deep Transformer as the basic architecture, and inevitably meet the over-parameterization problem. Early analysis on machine translation [476], abstractive summarization [477], and language understanding [478] have shown that a well-trained Transformer usually has redundant parameters, and can remove part of parameters without loss of performance. Recently, a series of work also discuss the over-parameterization problem in big models, containing the redundant heads in the multi-head attention layers [445,478], the sparse activation phenomenon in feed-forward network layer [479], and parameter redundancy problem of the whole Transformer [480,481].…”
Section: Model Analysismentioning
confidence: 99%