2022
DOI: 10.48550/arxiv.2203.11987
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning Patch-to-Cluster Attention in Vision Transformer

Abstract: The Vision Transformer (ViT) model is built on the assumption of treating image patches as "visual tokens" and learning patch-to-patch attention. The patch embedding based tokenizer is a workaround in practice and has a semantic gap with respect to its counterpart, the textual tokenizer. The patch-to-patch attention suffers from the quadratic complexity issue, and also makes it non-trivial to explain learned ViT models. To address these issues in ViT models, this paper proposes to learn patch-to-cluster attent… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 33 publications
0
1
0
Order By: Relevance
“…The attention mechanism is recognized as a potential means to enhance deep CNNs since it allows the network to selectively focus on the most important regions of an image while ignoring the ones with irrelevant parts. Currently, attention mechanisms are prevalent in various tasks, such as machine translation [ 42 ], object detection [ 43 ], and semantic segmentation [ 44 ]. More recently, multiple attention mechanisms have provided benefits in visual studies to improve convolutional network expression ability.…”
Section: Related Workmentioning
confidence: 99%
“…The attention mechanism is recognized as a potential means to enhance deep CNNs since it allows the network to selectively focus on the most important regions of an image while ignoring the ones with irrelevant parts. Currently, attention mechanisms are prevalent in various tasks, such as machine translation [ 42 ], object detection [ 43 ], and semantic segmentation [ 44 ]. More recently, multiple attention mechanisms have provided benefits in visual studies to improve convolutional network expression ability.…”
Section: Related Workmentioning
confidence: 99%