2023
DOI: 10.1145/3577925
|View full text |Cite
|
Sign up to set email alerts
|

Self-Supervised Learning for Videos: A Survey

Abstract: The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires great effort, which is especially challenging for videos. Moreover, the use of human-generated annotations leads to models with biased learning and poor domain generalization and robustness. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
22
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 62 publications
(22 citation statements)
references
References 134 publications
0
22
0
Order By: Relevance
“…Alternatively, many works learn to predict temporal transformations such as clip order [19,40,51,79], speed [5,8,82] and their combinations [32,48]. These self-supervised temporal representations are effective for classifying and retrieving coarsegrained actions but are challenged by downstream settings with subtle motions [62,70]. Other works utilize the multimodal nature of videos [1,2,20,23,49,52,57] and learn similarity with audio [1,2,52] and optical flow [20,23,54,77].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Alternatively, many works learn to predict temporal transformations such as clip order [19,40,51,79], speed [5,8,82] and their combinations [32,48]. These self-supervised temporal representations are effective for classifying and retrieving coarsegrained actions but are challenged by downstream settings with subtle motions [62,70]. Other works utilize the multimodal nature of videos [1,2,20,23,49,52,57] and learn similarity with audio [1,2,52] and optical flow [20,23,54,77].…”
Section: Related Workmentioning
confidence: 99%
“…This paper aims to learn self-supervised video representations, useful for distinguishing action classes. In a community effort to reduce the manual, expensive, and hardto-scale annotations needed for many downstream deployment settings, the topic has witnessed tremendous progress in recent years [19,32,62,79], particularly through contrastive learning [16,56,58,61]. Contrastive approaches learn representations through instance discrimination [55], where the goal is to increase feature similarity between spatially and temporally augmented clips from the same video.…”
Section: Introductionmentioning
confidence: 99%
“…In multimodal learning, models process and integrate data from multiple modalities [5,6,45], with applications in visual and language learning [43], video understanding [46,47], and natural language understanding [29,30,35]. However, expensive human annotations are often required for effective training.…”
Section: Self-supervised Multimodal Learningmentioning
confidence: 99%
“…However, expensive human annotations are often required for effective training. Self-supervised learning [2,46,52,64] has addressed this by using one modality as a supervisory signal for another, such as masking elements in images or text and using information from the other modality to reconstruct the masked content [1,3].…”
Section: Self-supervised Multimodal Learningmentioning
confidence: 99%
“…Numerous variants of GAN have been proposed and demonstrated the ability to learn a disentangled representation [ 49 , 50 , 51 , 52 , 53 ] and were reported to have comparable performance to VAE-based methods [ 53 ]. A relative field with disentangled representation learning is Self-Supervised Learning (SSL) [ 54 , 55 , 56 ]. Self-supervised learning provides a way for learning representation from unlabeled data.…”
Section: Introductionmentioning
confidence: 99%