An Empirical Study on the Relation Between Network Interpretability and Adversarial Robustness

Noack, Adam; Ahern, Isaac; Dou, Dejing; Li, Boyang

doi:10.1007/s42979-020-00390-x

“…Zhou et al (2020) propose to evaluate attribution methods through dataset modification. Noack et al (2021) show that image recognition models can achieve better adversarial robustness when they are trained to have interpretable gradients. To the best of our knowledge, we are the first to quantify the performance of rationale models under textual adversarial attacks and understand whether rationalization can inherently provide robustness.…”

Section: Related Workmentioning

confidence: 99%

Can Rationalization Improve Robustness?

Chen¹,

Jacqueline²,

Narasimhan³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

A growing line of work has investigated the development of neural NLP models that can produce rationales-subsets of input that can explain their model predictions. In this paper, we ask whether such rationale models can provide robustness to adversarial attacks in addition to their interpretable nature. Since these models need to first generate rationales ("rationalizer") before making predictions ("predictor"), they have the potential to ignore noise or adversarially added text by simply masking it out of the generated rationale. To this end, we systematically generate various types of 'AddText' attacks for both token and sentence-level rationalization tasks and perform an extensive empirical evaluation of state-of-the-art rationale models across five different tasks. Our experiments reveal that rationale models show promise in improving robustness but struggle in certain scenarios-e.g., when the rationalizer is sensitive to position bias or lexical choices of the attack text. Further, leveraging human rationales as supervision does not always translate to better performance. Our study is a first step towards exploring the interplay between interpretability and robustness in the rationalize-then-predict framework. 1

show abstract

“…Explainability of robust models Robust models were reported to have more interpretable gradient images [5,35,37,44,60] than those of vanilla CNNs. However, it is not yet known whether this superiority in interpretability remains when state-of-the-art AM methods are used.…”

Section: Related Workmentioning

confidence: 99%

How explainable are adversarially-robust CNNs?

Nourelahi¹,

Kotthoff²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Three important criteria of existing convolutional neural networks (CNNs) are (1) test-set accuracy; (2) out-of-distribution accuracy; and (3) explainability. While these criteria have been studied independently, their relationship is unknown. For example, do CNNs that have a stronger out-of-distribution performance have also stronger explainability? Furthermore, most prior feature-importance studies only evaluate methods on 2-3 common vanilla ImageNet-trained CNNs, leaving it unknown how these methods generalize to CNNs of other architectures and training algorithms. Here, we perform the first, large-scale evaluation of the relations of the three criteria using 9 feature-importance methods and 12 ImageNet-trained CNNs that are of 3 training algorithms and 5 CNN architectures. We find several important insights and recommendations for ML practitioners. First, adversarially robust CNNs have a higher explainability score on gradient-based attribution methods (but not CAM-based or perturbation-based methods). Second, AdvProp models, despite being highly accurate more than both vanilla and robust models alone, are not superior in explainability. Third, among 9 feature attribution methods tested, GradCAM and RISE are consistently the best methods. Fourth, Insertion and Deletion are biased towards vanilla and robust models respectively, due to their strong correlation with the confidence score distributions of a CNN. Fifth, we did not find a single CNN to be the best in all three criteria, which interestingly suggests that CNNs are harder to interpret as they become more accurate.Preprint. Under review.

show abstract

“…Zhou et al (2020) propose to evaluate attribution methods through dataset modification. Noack et al (2021) show that image recognition models can achieve better adversarial robustness when they are trained to have interpretable gradients. To the best of our knowledge, we are the first to quantify the performance of rationale models under textual adversarial attacks and understand whether rationalization can inherently provide robustness.…”

Section: Related Workmentioning

confidence: 99%

Can Rationalization Improve Robustness?

Chen¹,

Jacqueline²,

Narasimhan³

et al. 2022

Preprint

View full text Add to dashboard Cite

A growing line of work has investigated the development of neural NLP models that can produce rationales-subsets of input that can explain their model predictions. In this paper, we ask whether such rationale models can provide robustness to adversarial attacks in addition to their interpretable nature. Since these models need to first generate rationales ("rationalizer") before making predictions ("predictor"), they have the potential to ignore noise or adversarially added text by simply masking it out of the generated rationale. To this end, we systematically generate various types of 'AddText' attacks for both token and sentence-level rationalization tasks and perform an extensive empirical evaluation of state-of-the-art rationale models across five different tasks. Our experiments reveal that rationale models show promise in improving robustness but struggle in certain scenarios-e.g., when the rationalizer is sensitive to position bias or lexical choices of the attack text. Further, leveraging human rationales as supervision does not always translate to better performance. Our study is a first step towards exploring the interplay between interpretability and robustness in the rationalize-then-predict framework. 1

show abstract

An Empirical Study on the Relation Between Network Interpretability and Adversarial Robustness

Cited by 31 publications

References 25 publications

Can Rationalization Improve Robustness?

Can Rationalization Improve Robustness?

How explainable are adversarially-robust CNNs?

Can Rationalization Improve Robustness?

Contact Info

Product

Resources

About