Attention Meets Perturbations: Robust and Interpretable Attention With Adversarial Training

Kitada, Shunsuke; Iyatomi, Hitoshi

doi:10.1109/access.2021.3093456

Cited by 19 publications

(7 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…And we select the Binary Cross Entropy loss as D 1 and D 2 in (6). We compare our method with Vanilla attention (Wiegreffe and Pinter 2019), Word AT (Miyato, Dai, and Goodfellow 2016), Word iAT (Sato et al 2018), Attention RP (attention weight is trained with random perturbation), Attention AT and Attention iAT (Kitada and Iyatomi 2021).…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

SEAT: Stable and Explainable Attention

Liu

Liu³

et al. 2023

AAAI

View full text Add to dashboard Cite

Attention mechanism has become a standard fixture in many state-of-the-art natural language processing (NLP) models, not only due to its outstanding performance, but also because it provides plausible innate explanations for neural architectures. However, recent studies show that attention is unstable against randomness and perturbations during training or testing, such as random seeds and slight perturbation of embeddings, which impedes it from being a faithful explanation tool. Thus, a natural question is whether we can find an alternative to vanilla attention, which is more stable and could keep the key characteristics of the explanation. In this paper, we provide a rigorous definition of such an attention method named SEAT (Stable and Explainable ATtention). Specifically, SEAT has the following three properties: (1) Its prediction distribution is close to the prediction of the vanilla attention; (2) Its top-k indices largely overlap with those of the vanilla attention; (3) It is robust w.r.t perturbations, i.e., any slight perturbation on SEAT will not change the attention and prediction distribution too much, which implicitly indicates that it is stable to randomness and perturbations. Furthermore, we propose an optimization method for obtaining SEAT, which could be considered as revising the vanilla attention. Finally, through intensive experiments on various datasets, we compare our SEAT with other baseline methods using RNN, BiLSTM and BERT architectures, with different evaluation metrics on model interpretation, stability and accuracy. Results show that, besides preserving the original explainability and model performance, SEAT is more stable against input perturbations and training randomness, which indicates it is a more faithful explanation.

show abstract

Section: Methodsmentioning

confidence: 99%

“…There exists some work studying or improving either the stability or the robustness of attention from the explanation perspective. Recently, Kitada and Iyatomi (2021) propose a method to improve the robustness to perturbation of embedding vector for attention. Specifically, they adopt adversarial training during the training process.…”

Section: Related Workmentioning

confidence: 99%

SEAT: Stable and Explainable Attention

Liu

Liu³

et al. 2023

AAAI

View full text Add to dashboard Cite

show abstract

“…Therefore, Miyato et al [14] applied FGSM to NLP tasks by perturbing word embeddings rather than real text input, which is applicable in both supervised and semi-supervised scenarios, as it uses Virtual Adversarial Training (VAT) [25] in the latter. [26]- [28] proposed different works to add the perturbation to the attention mechanism of transformer-based methods instead of the word embeddings. To generate adversarial examples, Madry et al [24] adopted the multi-step approach in contrast to the single-step FGSM.…”

Section: A Adversarial Trainingmentioning

confidence: 99%

Adversarial Training for Fake News Classification

et al. 2022

View full text Add to dashboard Cite

News is a source of information to know about progress in the various areas of life all across the globe. However, the volume of this information is high, and getting benefits from the available information becomes difficult. Moreover, the frequency of fake news is increasing significantly and used to fulfill a particular agenda. This led to research on the classification of news to prevent the spread of disinformation. In this work, we use Adversarial Training as a means of regularization for fake news classification. We train two transformed-based encoder models using adversarial examples that help the model learn noise invariant representations. We generate these examples by perturbing the model's word embedding matrix, and then we fine-tune the model on clean and adversarial examples simultaneously. We train and evaluate the models on the Buzzfeed Political News and Random Political News datasets. Results show consistent improvements over the baseline models when we train models using adversarial examples. Experiments show that Adversarial Training improves the performance by 1.25% over the BERT baseline, 2.05% over the Longformer baseline for the Random Political News dataset, 1.25% over the BERT baseline and 0.9% over the Longformer baseline for Buzzfeed Political News dataset in terms of F1-score.

show abstract

“…[28] perturbed word embeddings instead of original input text using FGSM for NLP task. Some of the recent works [20,21,44] added perturbations to the attention mechanism of transformer methods using FGSM. [26] used multi-step FGSM to generate adversarial examples that proved more effective at the cost of computational overhead.…”

Section: Adversarial Trainingmentioning

confidence: 99%

A Novel Approach to Train Diverse Types of Language Models for Health Mention Classification of Tweets

Khan¹,

Razzak²,

Dengel³

et al. 2022

Preprint

View full text Add to dashboard Cite

Health mention classification deals with the disease detection in a given text containing disease words. However, non-health and figurative use of disease words adds challenges to the task. Recently, adversarial training acting as a means of regularization has gained popularity in many NLP tasks. In this paper, we propose a novel approach to train language models for health mention classification of tweets that involves adversarial training. We generate adversarial examples by adding perturbation to the representations of transformer models for tweet examples at various levels using Gaussian noise. Further, we employ contrastive loss as an additional objective function. We evaluate the proposed method on the PHM2017 dataset extended version. Results show that our proposed approach improves the performance of classifier significantly over the baseline methods. Moreover, our analysis shows that adding noise at earlier layers improves models' performance whereas adding noise at intermediate layers deteriorates models' performance. Finally, adding noise towards the final layers performs better than the middle layers noise addition.

show abstract

Attention Meets Perturbations: Robust and Interpretable Attention With Adversarial Training

Cited by 19 publications

References 25 publications

SEAT: Stable and Explainable Attention

SEAT: Stable and Explainable Attention

Adversarial Training for Fake News Classification

A Novel Approach to Train Diverse Types of Language Models for Health Mention Classification of Tweets

Contact Info

Product

Resources

About