Self-Supervised Learning for Videos: A Survey

Schiappa, Madeline C.; Rawat, Yogesh Singh; Shah, Mubarak

doi:10.1145/3577925

Cited by 62 publications

(22 citation statements)

References 134 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Alternatively, many works learn to predict temporal transformations such as clip order [19,40,51,79], speed [5,8,82] and their combinations [32,48]. These self-supervised temporal representations are effective for classifying and retrieving coarsegrained actions but are challenged by downstream settings with subtle motions [62,70]. Other works utilize the multimodal nature of videos [1,2,20,23,49,52,57] and learn similarity with audio [1,2,52] and optical flow [20,23,54,77].…”

Section: Related Workmentioning

confidence: 99%

“…This paper aims to learn self-supervised video representations, useful for distinguishing action classes. In a community effort to reduce the manual, expensive, and hardto-scale annotations needed for many downstream deployment settings, the topic has witnessed tremendous progress in recent years [19,32,62,79], particularly through contrastive learning [16,56,58,61]. Contrastive approaches learn representations through instance discrimination [55], where the goal is to increase feature similarity between spatially and temporally augmented clips from the same video.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization

Thoker¹,

Doughty²,

Snoek³

2023

Preprint

View full text Add to dashboard Cite

We propose a self-supervised method for learning motion-focused video representations. Existing approaches minimize distances between temporally augmented videos, which maintain high spatial similarity. We instead propose to learn similarities between videos with identical local motion dynamics but an otherwise different appearance. We do so by adding synthetic motion trajectories to videos which we refer to as tubelets. By simulating different tubelet motions and applying transformations, such as scaling and rotation, we introduce motion patterns beyond what is present in the pretraining data. This allows us to learn a video representation that is remarkably data-efficient: our approach maintains performance when using only 25% of the pretraining videos. Experiments on 10 diverse downstream settings demonstrate our competitive performance and generalizability to new domains and fine-grained actions.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization

Thoker¹,

Doughty²,

Snoek³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…In multimodal learning, models process and integrate data from multiple modalities [5,6,45], with applications in visual and language learning [43], video understanding [46,47], and natural language understanding [29,30,35]. However, expensive human annotations are often required for effective training.…”

Section: Self-supervised Multimodal Learningmentioning

confidence: 99%

“…However, expensive human annotations are often required for effective training. Self-supervised learning [2,46,52,64] has addressed this by using one modality as a supervisory signal for another, such as masking elements in images or text and using information from the other modality to reconstruct the masked content [1,3].…”

Section: Self-supervised Multimodal Learningmentioning

confidence: 99%

Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity and Relation Extraction

Hu,

Chen,

Liu

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

How can we better extract entities and relations from text? Using multimodal extraction with images and text obtains more signals for entities and relations, and aligns them through graphs or hierarchical fusion, aiding in extraction. Despite attempts at various fusions, previous works have overlooked many unlabeled image-caption pairs, such as NewsCLIPing. This paper proposes innovative pretraining objectives for entity-object and relation-image alignment, extracting objects from images and aligning them with entity and relation prompts for soft pseudo-labels. These labels are used as self-supervised signals for pre-training, enhancing the ability to extract entities and relations. Experiments on three datasets show an average 3.41% F1 improvement over prior SOTA. Additionally, our method is orthogonal to previous multimodal fusions, and using

show abstract

“…Numerous variants of GAN have been proposed and demonstrated the ability to learn a disentangled representation [ 49 , 50 , 51 , 52 , 53 ] and were reported to have comparable performance to VAE-based methods [ 53 ]. A relative field with disentangled representation learning is Self-Supervised Learning (SSL) [ 54 , 55 , 56 ]. Self-supervised learning provides a way for learning representation from unlabeled data.…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Learning of Disentangled Representation via Auto-Encoding: A Survey

Eddahmani¹,

Pham

Napoléon³

et al. 2023

Sensors

View full text Add to dashboard Cite

In recent years, the rapid development of deep learning approaches has paved the way to explore the underlying factors that explain the data. In particular, several methods have been proposed to learn to identify and disentangle these underlying explanatory factors in order to improve the learning process and model generalization. However, extracting this representation with little or no supervision remains a key challenge in machine learning. In this paper, we provide a theoretical outlook on recent advances in the field of unsupervised representation learning with a focus on auto-encoding-based approaches and on the most well-known supervised disentanglement metrics. We cover the current state-of-the-art methods for learning disentangled representation in an unsupervised manner while pointing out the connection between each method and its added value on disentanglement. Further, we discuss how to quantify disentanglement and present an in-depth analysis of associated metrics. We conclude by carrying out a comparative evaluation of these metrics according to three criteria, (i) modularity, (ii) compactness and (iii) informativeness. Finally, we show that only the Mutual Information Gap score (MIG) meets all three criteria.

show abstract

Self-Supervised Learning for Videos: A Survey

Cited by 62 publications

References 134 publications

Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization

Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization

Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity and Relation Extraction

Unsupervised Learning of Disentangled Representation via Auto-Encoding: A Survey

Contact Info

Product

Resources

About