2022
DOI: 10.48550/arxiv.2208.08984
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Open-Vocabulary Panoptic Segmentation with MaskCLIP

Abstract: In this paper, we tackle a new computer vision task, open-vocabulary panoptic segmentation, that aims to perform panoptic segmentation (background semantic labeling + foreground instance segmentation) for arbitrary categories of text-based descriptions. We first build a baseline method without finetuning nor distillation to utilize the knowledge in the existing CLIP model. We then develop a new method, MaskCLIP, that is a Transformer-based approach using mask queries with the ViT-based CLIP backbone to perform… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
22
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(23 citation statements)
references
References 26 publications
1
22
0
Order By: Relevance
“…Besides, the mask generator is CLIP-unaware, further limiting its performance. MaskCLIP [10] improves the two-stage framework by progressively refining the predicted masks by the CLIP encoder, and applying masks in attention layers to avoid forwarding multiple times, which was first introduced by [7]. However, MaskCLIP still needs a heavy mask generator, the initial mask prediction is also CLIP-unaware, and the mask prediction and recognition are coupled.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…Besides, the mask generator is CLIP-unaware, further limiting its performance. MaskCLIP [10] improves the two-stage framework by progressively refining the predicted masks by the CLIP encoder, and applying masks in attention layers to avoid forwarding multiple times, which was first introduced by [7]. However, MaskCLIP still needs a heavy mask generator, the initial mask prediction is also CLIP-unaware, and the mask prediction and recognition are coupled.…”
Section: Related Workmentioning
confidence: 99%
“…Our approach is an end-to-end framework, the mask prediction is lightweight and CLIP-aware, and the mask recognition is decoupled from mask prediction. These differences allow our approach can better leverage the capability of CLIP than two-stage approaches [10,33].…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…• CLIP-based Segmentation. Many segmentation models directly adapt the pre-trained CLIP to pixel-level visual recognition tasks, including PhraseCut (Wu et al, 2020), OpenSeg (Ghiasi et al, 2022), CLIPSeg (Lüddecke and Ecker, 2022), ZS-Seg (Xu et al, 2021d), MaskCLIP (Zhou et al, 2022a), DenseCLIP (Rao et al, 2021) and MaskCLIP (Ding et al, 2022b). OpenSeg (Ghiasi et al, 2022) also performs model learning with class agnostic mask annotations for generating mask proposals.…”
Section: Vlp For Segmentationmentioning
confidence: 99%