2021
DOI: 10.48550/arxiv.2110.02797
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs

Abstract: Convolutional Neural Networks (CNNs) have become the de facto gold standard in computer vision applications in the past years. Recently, however, new model architectures have been proposed challenging the status quo. The Vision Transformer (ViT) relies solely on attention modules, while the MLP-Mixer architecture substitutes the selfattention modules with Multi-Layer Perceptrons (MLPs). Despite their great success, CNNs have been widely known to be vulnerable to adversarial attacks, causing serious concerns fo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
21
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 16 publications
(21 citation statements)
references
References 52 publications
0
21
0
Order By: Relevance
“…Recently, a number of works compare the robustness of ViTs to ResNets. While there are mixed findings on adversarial robustness [4,46], there is agreement that ViTs have stronger out-of-distribution generalization, likely due to self attention [5,39]. In contrast, our work focuses on relative robustness to noise in foreground and background regions.…”
Section: Modelsmentioning
confidence: 89%
“…Recently, a number of works compare the robustness of ViTs to ResNets. While there are mixed findings on adversarial robustness [4,46], there is agreement that ViTs have stronger out-of-distribution generalization, likely due to self attention [5,39]. In contrast, our work focuses on relative robustness to noise in foreground and background regions.…”
Section: Modelsmentioning
confidence: 89%
“…Although rich literature exists related to the robustness of CNNs in the medical imaging domain, to the best of our knowledge, no such study exists for ViTs, making it an exciting as well challenging direction to explore. Recently, few attempts have been made to evaluate the robustness of ViTs to adversarial attacks for natural images [407]- [416]. The main conclusions of these attempts, ignoring their nuance difference, can be summarized as ViTs are more robust to adversarial attacks than CNNs.…”
Section: Adversarial Robustnessmentioning
confidence: 99%
“…In addition to competitive classification performance, two other recent works [113,110] explore the robustness of CNNs, Vision Transformer, and MLPs. CNNs have been widely known to be vulnerable to adversarial attacks [114,115], that is, small additive perturbations of the input cause the CNN to misclassify a sample, causing serious concerns for security-sensitive applications.…”
Section: Model Performancementioning
confidence: 99%
“…But how about the MLP-Mixer? Benz et al [110] compare ResNet, ViT and MLP-Mixer under white-box attack (Table 5), as well as both query-based and transfer-based black-box attacks. In all cases, ViT is the most robust architecture, MLP is the next, while CNN is the least robust.…”
Section: Model Performancementioning
confidence: 99%
See 1 more Smart Citation