2023
DOI: 10.3389/fcomp.2023.1178450
|View full text |Cite
|
Sign up to set email alerts
|

Self-attention in vision transformers performs perceptual grouping, not attention

Abstract: Recently, a considerable number of studies in computer vision involve deep neural architectures called vision transformers. Visual processing in these models incorporates computational models that are claimed to implement attention mechanisms. Despite an increasing body of work that attempts to understand the role of attention mechanisms in vision transformers, their effect is largely unknown. Here, we asked if the attention mechanisms in vision transformers exhibit similar effects as those known in human visu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 13 publications
(4 citation statements)
references
References 93 publications
0
4
0
Order By: Relevance
“…In contrast, current object-centric representation learning methods do not make any attempt to capture this configural structure. Instead, their operation can be better described as perceptual segregation through clustering (Locatello et al, 2020;Mehrani and Tsotsos, 2023). Critically, there is substantial evidence that even the simple single-part abstract contour shapes used in the current study are represented by the human visual system as configurations of relations between segments of approximate constant curvature (Baker and Kellman, 2018;Baker, Garrigan and Kellman, 2021;.…”
Section: Discussionmentioning
confidence: 91%
See 2 more Smart Citations
“…In contrast, current object-centric representation learning methods do not make any attempt to capture this configural structure. Instead, their operation can be better described as perceptual segregation through clustering (Locatello et al, 2020;Mehrani and Tsotsos, 2023). Critically, there is substantial evidence that even the simple single-part abstract contour shapes used in the current study are represented by the human visual system as configurations of relations between segments of approximate constant curvature (Baker and Kellman, 2018;Baker, Garrigan and Kellman, 2021;.…”
Section: Discussionmentioning
confidence: 91%
“…Although a wide range of DNN architectures can be classified as object-centric, a common attribute across these models is the use of attention mechanisms that isolate individual objects. For most models, this is done through what has been termed affinity-based attention (Adeli, Ahn, Kriegeskorte and Zelinsky, 2023a) or perceptual grouping (Mehrani and Tsotsos, 2023). This process can be described as a bottom-up clustering of the visual input based on perceptual features such as color, texture and position (Mehrani and Tsotsos, 2023;Locatello et al, 2020).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…By splitting the convolution calculation process and integrating the SA method, it skillfully combines CNN and Transformer to expand feature representation capabilities. Subsequently, MobileVitv2 [39] proposed a linear time complexity SA method, which greatly reduced the resource occupation of the model. MobileVitv3 [40] improved the generalisation of the model by integrating context multi‐scale features.…”
Section: Related Workmentioning
confidence: 99%