Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.269
|View full text |Cite
|
Sign up to set email alerts
|

How Does Selective Mechanism Improve Self-Attention Networks?

Abstract: Self-attention networks (SANs) with selective mechanism has produced substantial improvements in various NLP tasks by concentrating on a subset of input words. However, the underlying reasons for their strong performance have not been well explained. In this paper, we bridge the gap by assessing the strengths of selective SANs (SSANs), which are implemented with a flexible and universal Gumbel-Softmax. Experimental results on several representative NLP tasks, including natural language inference, semantic role… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 27 publications
(12 citation statements)
references
References 35 publications
0
12
0
Order By: Relevance
“…To alleviate it, we propose the context-aware approach to make the cross-attention pay more attention to source-side local words, which in turn improves the translation performance over several benchmarks. In future work, we will investigate selectively choosing the context (Geng et al, 2020; rather than the fixed window size. Besides, it is interesting to enhance NAT model with extra signals, such as cross-lingual position embedding (Ding et al, 2020), larger context (Wang et al, 2017) and pre-trained initialization .…”
Section: Discussionmentioning
confidence: 99%
“…To alleviate it, we propose the context-aware approach to make the cross-attention pay more attention to source-side local words, which in turn improves the translation performance over several benchmarks. In future work, we will investigate selectively choosing the context (Geng et al, 2020; rather than the fixed window size. Besides, it is interesting to enhance NAT model with extra signals, such as cross-lingual position embedding (Ding et al, 2020), larger context (Wang et al, 2017) and pre-trained initialization .…”
Section: Discussionmentioning
confidence: 99%
“…For the second approach, there comes a new idea that make the node select the import information from other nodes automatically by attention mechanism. So the Gumbel-Sigmoid [13] is adopted to transform the attention matrix about nodes to get the adjacency matrix. Gumbel-Sigmoid is as follows:…”
Section: English Fake News Detection Taskmentioning
confidence: 99%
“…Where G and G are two independent Gumbel noises [14], and τ ∈ (0, ∞) is a temperature parameter. As τ approaching zero, the sample from the Gumbel-Sigmoid distribution becomes cold and resembles the one-hot sample This approach is the same as doing the select self-attention [13] on the whole dataset. Use the Gumbel-Sigmoid to select which news needs to compute the attention about this news.…”
Section: English Fake News Detection Taskmentioning
confidence: 99%
“…For self-attentional sentence encoding, Shen et al (2018) train hard attention mechanisms which select a subset of tokens via policy gradient. Geng et al (2020) investigate selective self-attention networks implemented with Gumble-Sigmoid. Sparse attention has been found benefitial for performance (Malaviya et al, 2018;Peters et al, 2019;Correia et al, 2019;Indurthi et al, 2019;Maruf et al, 2019).…”
Section: Testing On Wmt 17 Tasksmentioning
confidence: 99%