2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00427
|View full text |Cite
|
Sign up to set email alerts
|

Multi-class Token Transformer for Weakly Supervised Semantic Segmentation

Abstract: This paper proposes a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic segmentation (WSSS). Inspired by the fact that the attended regions of the one-class token in the standard vision transformer can be leveraged to form a class-agnostic localization map, we investigate if the transformer model can also effectively capture class-specific attention for more discriminative object localization by learning multiple class tokens within… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
72
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 161 publications
(72 citation statements)
references
References 46 publications
(76 reference statements)
0
72
0
Order By: Relevance
“…For data augmentation, we apply mixup [17], color jitter, foggy [6], snow [6], resize, horizontal flipping, rotation, blur, noise, random erase [19], and mask-level copy-paste. To obtain more vivid occlusion effects, we introduce MCTformer [15] to generate pseudo masks of the salient objects in the subset of ImageNet-1K. Then, these masks are used in the mask-level copy-paste strategy [5], which are pasted onto the training images to simulate object occlusion.…”
Section: Methodsmentioning
confidence: 99%
“…For data augmentation, we apply mixup [17], color jitter, foggy [6], snow [6], resize, horizontal flipping, rotation, blur, noise, random erase [19], and mask-level copy-paste. To obtain more vivid occlusion effects, we introduce MCTformer [15] to generate pseudo masks of the salient objects in the subset of ImageNet-1K. Then, these masks are used in the mask-level copy-paste strategy [5], which are pasted onto the training images to simulate object occlusion.…”
Section: Methodsmentioning
confidence: 99%
“…This last is connected with the class prediction, using a BCE prediction loss. More recently, Vision Transformers [15] are emerging as an alternative to generate CAM [58,39]. Our method is the first one using only ViT without CAM to generate baseline pseudo-masks.…”
Section: Related Workmentioning
confidence: 99%
“…Dino [5] downstream task segments foreground from background for single class images, differing from WSSS. Only recently ViT contributed to WSSS with MCTformer [58] and AFA [39], though both resort to CAM. MCTformer exploits ViT attention mechanism to obtain localization maps.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, neural network design in natural language processing (NLP) has embarked on a completely different path since Transformer [ 18 ] has replaced recurrent neural networks as the dominant network architecture. With the introduction of Vision Transformers (ViT) [ 19 ], more and more scholars apply Transformer to computer vision [ 20 , 21 , 22 , 23 , 24 , 25 ]. The Transformer’s superior design architecture and self-attention mechanism can better model spatial relationships and aggregate features at arbitrary locations.…”
Section: Introductionmentioning
confidence: 99%