2021
DOI: 10.1109/tpami.2020.2992393
|View full text |Cite
|
Sign up to set email alerts
|

Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

3
843
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 1,447 publications
(846 citation statements)
references
References 126 publications
3
843
0
Order By: Relevance
“…Features extracted from pre-trained SSL models are known as Self Supervised Embeddings (SSE). SSL has become a prominent representation learning paradigm in both Natural Language Processing (NLP) and Computer Vision (CV) communities [15], [20], [26]. SSL algorithms have two stages.…”
Section: B Multimodal Features Extracted From Pre-trained-ssl Algorimentioning
confidence: 99%
See 1 more Smart Citation
“…Features extracted from pre-trained SSL models are known as Self Supervised Embeddings (SSE). SSL has become a prominent representation learning paradigm in both Natural Language Processing (NLP) and Computer Vision (CV) communities [15], [20], [26]. SSL algorithms have two stages.…”
Section: B Multimodal Features Extracted From Pre-trained-ssl Algorimentioning
confidence: 99%
“…In contrast to previous work, we represent all input modalities (audio, video, and text) with deep features extracted from pre-trained Self Supervised Learning (SSL) -a powerful representation learning technique -models [15]- [17]. Although SSL features give powerful representations of the input modalities, it is an extremely challenging task to fuse them before the final prediction due to the following reasons: 1) High dimensionality of SSL embeddings 2) Longer sequence lengths of SSL features 3) Mismatch between sizes and sequence lengths of SSL features across modalities that have extracted from different SSL models Although a simple concatenation seems like a viable option, additional trainable parameters required to fully connect the high dimensional SSL embeddings make the network FIGURE 1: Overview of the Self Supervised Embedding Fusion Transformer (SSE-FT)…”
Section: Introductionmentioning
confidence: 99%
“…Voxelization representation is widely used in 3D modeling tasks and other related 3D vision tasks. For example, VoxelNet [ 17 ], VConv-DAE [ 18 ], and LightNet [ 19 , 20 , 21 ].…”
Section: Related Workmentioning
confidence: 99%
“…Deep learning methods, particularly deep convolutional neural networks (CNN) [20,21], have quickly become the preferred approach for processing medical images [22,23]. Large-scale datasets are usually required to train deep neural networks [24]. The ChestX-ray 14 dataset, released by the National Institutes of Health (NIH) in 2017 [25], is known as one of the largest hospital-scale ChestX-ray datasets.…”
Section: Introductionmentioning
confidence: 99%