2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00008
|View full text |Cite
|
Sign up to set email alerts
|

GLiT: Neural Architecture Search for Global and Local Image Transformer

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
30
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 66 publications
(30 citation statements)
references
References 18 publications
0
30
0
Order By: Relevance
“…The feature maps are firstly processed by a window-based local attention to aggregate information locally, and then passed through the deformable attention block to model the global relations between the local augmented tokens. This alternate design of attention blocks with local and global receptive fields helps the model learn strong representations, sharing a similar pattern in GLiT [5], TNT [15] and Point-…”
Section: Model Architecturesmentioning
confidence: 97%
“…The feature maps are firstly processed by a window-based local attention to aggregate information locally, and then passed through the deformable attention block to model the global relations between the local augmented tokens. This alternate design of attention blocks with local and global receptive fields helps the model learn strong representations, sharing a similar pattern in GLiT [5], TNT [15] and Point-…”
Section: Model Architecturesmentioning
confidence: 97%
“…Our method outperforms all of them at different target parameters and FLOPs. GLiT-S [6] improves accuracy (80.5%) over baseline DeiT-S but with an additional parameter increase and minimal FLOPs reduction. Our ViT- .…”
Section: Comparison To State-of-the-art Approachesmentioning
confidence: 96%
“…Due to the adaptation of evolutionary algorithm, a large search space cannot be explored. GLiT [6] introduces locality module to model local features along with the global features. But their method uses CNNs along with attention and performs an evolutionary search over global and local modules.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…As a result, we can eliminate some costly MSA layers to significantly improve the efficiency for ViT models while adding proper locality without heuristics. Recently, several one-shot NAS methods propose to include candidate MSA and convolutional operations separately to the search space [9,33,60], where each operation is formulated as a separate trainable path (see Figure 1 (a)). However, these multi-path methods can give rise to the expensive search cost and challenging optimization since all the candidate operations for each layer need to be maintained and updated independently during the search.…”
mentioning
confidence: 99%