2021
DOI: 10.48550/arxiv.2108.03428
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

PSViT: Better Vision Transformer via Token Pooling and Attention Sharing

Boyu Chen,
Peixia Li,
Baopu Li
et al.

Abstract: In this paper, we observe two levels of redundancies when applying vision transformers (ViT) for image recognition. First, fixing the number of tokens through the whole network produces redundant features at the spatial level. Second, the attention maps among different transformer layers are redundant. Based on the observations above, we propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy, effectively enhancing the feature representation ability, and achieving a better spee… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
6
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 25 publications
0
6
0
Order By: Relevance
“…However, such unstructured sparsity results in incompatibility with dense prediction tasks. Some structurepreserving token selection strategies were implemented via token pooling (Chen et al 2021a) and a slow-fast updating (Xu et al 2021).…”
Section: Efficient Vision Transformersmentioning
confidence: 99%
“…However, such unstructured sparsity results in incompatibility with dense prediction tasks. Some structurepreserving token selection strategies were implemented via token pooling (Chen et al 2021a) and a slow-fast updating (Xu et al 2021).…”
Section: Efficient Vision Transformersmentioning
confidence: 99%
“…Many subsequent works address this issue by establishing a progressive shrinking pyramid that allows models to explicitly process low-level patterns. There is a group of approaches that merge tokens within each fixed window into one to reduce the number of tokens [8,24,37,46,63,66,76]. In contrast, the second group of methods drops this constraint and introduces more flexible selection scheme [9,43,50,77].…”
Section: Related Workmentioning
confidence: 99%
“…However, other tokens maintain the ability to express distinctive patterns and may delicately assist final prediction. Some works proposed to remove the [cls] token and construct a global token by integrating patch tokens via certain average pooling operation [8,37,46,51]. LV-ViT [29] explored the possibility to jointly utilize [cls] token and patch tokens.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, the pioneering work ViT [22] successfully applies the pure transformer-based architecture to computer vision, revealing the potential of transformer in handling visual tasks. Lots of follow-up studies are proposed [4,5,9,12,18,21,23,24,[27][28][29]31,38,41,43,45,50,52,56,76,77,80,81,84]. Many of them analyze the ViT [15,17,26,32,44,55,69,73,75,82] and improve it via introducing locality to earlier layers [11,17,48,64,79,83,87].…”
Section: Related Workmentioning
confidence: 99%