2022
DOI: 10.48550/arxiv.2206.09325
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Abstract: Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based Transformer (EAT) block, which consists of three residual parts, i.e., Multi-Scale Region Aggregation (MSRA), Global and Local Interaction (GLI), and Feed-Forward Net… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 68 publications
0
6
0
Order By: Relevance
“…There are mainly three different directions for Transformer in vision: representation learning as a feature extractor, vision-language modeling, and using object query for downstream detection-related tasks. For the first aspect, ViTs [27], [58], [59], [60] have more advantages in modeling global-range relation among the image patch features. Most recent works [61], [62] combine the local CNN design with ViTs.…”
Section: Related Workmentioning
confidence: 99%
“…There are mainly three different directions for Transformer in vision: representation learning as a feature extractor, vision-language modeling, and using object query for downstream detection-related tasks. For the first aspect, ViTs [27], [58], [59], [60] have more advantages in modeling global-range relation among the image patch features. Most recent works [61], [62] combine the local CNN design with ViTs.…”
Section: Related Workmentioning
confidence: 99%
“…For the input face image 𝐼 , with two different masks 𝑀 𝑏 and 𝑀 π‘œ , we can get two masked face image 𝐼 𝑀 𝑏 and 𝐼 𝑀 π‘œ as positive pair, we expect that the model can recognize they come from the same face image. We use the class token of the Vision Transformer [9,[27][28][29] as the identifying label and a teacher-student framework to get the predictive categorical distributions.…”
Section: Random Mask For Contrastive Learningmentioning
confidence: 99%
“…Plain Vision Transformer. Since Vision Transformer (ViT) [18] first introduced Transformer [91] structure into visual classification successfully, massive improvements have been subsequently developed [92], [93], [94], [95], [96], [97], [98]. Benefiting from global dynamic modeling capabilities, columnar plain ViT offers more excellent usability and practical values compared to the more complex pyramidal structures.…”
Section: Related Workmentioning
confidence: 99%
“…Thanks to the global modeling capability of Multi-Head Self-Attention (MHSA), ViT can simultaneously pay attention to distant low-frequency information and close high-frequency information [97], [110]. That is what CNN, with the local modeling manner, does not have.…”
Section: Advantage Explanation Of Vitmentioning
confidence: 99%