2021
DOI: 10.48550/arxiv.2107.12518
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

Abstract: We introduce a method that allows to automatically segment images into semantically meaningful regions without human supervision. Derived regions are consistent across different images and coincide with human-defined semantic classes on some datasets. In cases where semantic regions might be hard for human to define and consistently label, our method is still able to find meaningful and consistent semantic classes. In our work, we use pretrained Style-GAN2 [1] generative model: clustering in the feature space … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
14
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(14 citation statements)
references
References 17 publications
0
14
0
Order By: Relevance
“…Although the best single model performance reaches an average precision of 0.6 for half of the classes, even 0.8 for some classes (Supplementary Table 3), rarer classes differed considerably from human expert performance (accuracy ~0.9 based on agreement between multi-annotators). Developments in self-supervised 39 , unsupervised 40 and few-shot 41 learning could potentially tackle high class imbalance and rare class or novelty detection to a much greater extent. However, despite these limitations, the winning models still demonstrated an ability to focus attention on the subcellular regions in which the proteins lie, and we see great advances in cell-level model attention compared with image-level model attention (Fig.…”
Section: Discussionmentioning
confidence: 99%
“…Although the best single model performance reaches an average precision of 0.6 for half of the classes, even 0.8 for some classes (Supplementary Table 3), rarer classes differed considerably from human expert performance (accuracy ~0.9 based on agreement between multi-annotators). Developments in self-supervised 39 , unsupervised 40 and few-shot 41 learning could potentially tackle high class imbalance and rare class or novelty detection to a much greater extent. However, despite these limitations, the winning models still demonstrated an ability to focus attention on the subcellular regions in which the proteins lie, and we see great advances in cell-level model attention compared with image-level model attention (Fig.…”
Section: Discussionmentioning
confidence: 99%
“…CLIP for generation/manipulation. The idea of multimodal feature space also inspires some recent works on generative models [9,10,31,33]. All of these works are related to ours in that the tools of pre-trained CLIP model and StyleGAN2 are employed.…”
Section: Related Workmentioning
confidence: 91%
“…Our LAFITE is different in two aspects: (i) The motivations and scenarios are different. Existing works focus on latent optimization [10], image manipulation [33], domain adaptation [9], image segmen-tation [31]. We present the first study on training text-toimage generation models without the requirement of paired captions.…”
Section: Related Workmentioning
confidence: 99%
“…Several approaches have also been proposed to enable the synthesis of segmentation masks alongside images generated by StyleGAN, either by using separate generator branches [24] or by exploiting the feature space of the generator [30,44]. The latter approach showcased the ability to generate high quality datasets of paired images and segmentation masks, with only a few annotated examples, and was aptly named DatasetGAN [44].…”
Section: Related Workmentioning
confidence: 99%
“…Researchers are, therefore, increasingly looking into automatic techniques that allow for the generation of synthetic datasets that require no (or minimal) human intervention during the annotation process [8,24,30,44]. However, several challenges are associated with such an approach: (i) the synthetic (training) samples need to be as close as possible to the expected real-world data to allow for the trained model to perform well during deployment, (ii) the synthesis procedure must allow for the generation of large and diverse datasets that can cater to the data needs of modern deep learning models, and (iii) data annotations need to be produced automatically, without (or with minimal) supervision.…”
Section: Introductionmentioning
confidence: 99%