Learning and Using the Arrow of Time

Wei, Donglai; Lim, Joseph J.; Zisserman, Andrew; Freeman, William T.

doi:10.1109/cvpr.2018.00840

Cited by 346 publications

(253 citation statements)

References 15 publications

Supporting

Mentioning

253

Contrasting

Order By: Relevance

“…Self-supervised learning defines a proxy task on unlabeled data and uses the pseudo-labels of that task to provide the model with supervisory signals. It is used in machine vision with proxy tasks such as predicting arrow of time [79], missing pixels [50], position of patches [14], image rotations [23], synthetic artifacts [33], image clusters [9], camera transformation in consecutive frames [3], rearranging shuffled patches [48], video colourization [73], and tracking of image patches [77] and has demonstrated promising results in learning and transferring visual features.…”

Section: Self-supervised Learningmentioning

confidence: 99%

Unsupervised Multi-Task Feature Learning on Point Clouds

Hassani

Haley

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

183

117

View full text Add to dashboard Cite

We introduce an unsupervised multi-task model to jointly learn point and shape features on point clouds. We define three unsupervised tasks including clustering, reconstruction, and self-supervised classification to train a multi-scale graph-based encoder. We evaluate our model on shape classification and segmentation benchmarks. The results suggest that it outperforms prior state-of-the-art unsupervised models: In the ModelNet40 classification task, it achieves an accuracy of 89.1% and in ShapeNet segmentation task, it achieves an mIoU of 68.2 and accuracy of 88.6%.

show abstract

Section: Self-supervised Learningmentioning

confidence: 99%

Unsupervised Multi-Task Feature Learning on Point Clouds

Hassani

Haley

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

183

117

View full text Add to dashboard Cite

show abstract

“…Self-supervised learning on video collections. Learning from video [2,10,15,17,21,22,30,31,35,40,42,47,52,62,64] is a powerful paradigm, as unlike with image collections, there is additional temporal and sequential information. The aim of self-supervised learning from video can be to learn to predict future frames [47], or to learn to predict depth [12,14,62].…”

Section: Related Workmentioning

confidence: 99%

Self-Supervised Learning of Class Embeddings from Video

Wiles

Koepke

Zisserman

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

Self Cite

View full text Add to dashboard Cite

This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information. At train time, two frames of the same video of an object class (e.g. human upper body) are extracted and each encoded to an embedding. Conditioned on these embeddings, the decoder network is tasked to transform one frame into another. To successfully perform long range transformations (e.g. a wrist lowered in one image should be mapped to the same wrist raised in another), we introduce a hierarchical probabilistic network decoder model. Once trained, the embedding can be used for a variety of downstream tasks and domains. We demonstrate our approach quantitatively on three distinct deformable object classeshuman full bodies, upper bodies, faces -and show experimentally that the learned embeddings do indeed generalise. They achieve state-ofthe-art performance in comparison to other self-supervised methods trained on the same datasets, and approach the performance of fully supervised methods.

show abstract

“…Self-supervision for Action Recognition. Self-supervision methods learn representations from the temporal [13,59] and multi-modal structure of video [1,25], leveraging pretraining on a large corpus of unlabelled videos. Methods exploiting the temporal consistency of video have predicted the order of a sequence of frames [13] or the arrow of time [59].…”

Section: Related Workmentioning

confidence: 99%

Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

Munro

Damen

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

Fine-grained action recognition datasets exhibit environmental bias, where multiple video sequences are captured from a limited number of environments. Training a model in one environment and deploying in another results in a drop in performance due to an unavoidable domain shift. Unsupervised Domain Adaptation (UDA) approaches have frequently utilised adversarial training between the source and target domains. However, these approaches have not explored the multi-modal nature of video within each domain. In this work we exploit the correspondence of modalities as a self-supervised alignment approach for UDA in addition to adversarial alignment (Fig. 1).We test our approach on three kitchens from our largescale dataset, EPIC-Kitchens [8], using two modalities commonly employed for action recognition: RGB and Optical Flow. We show that multi-modal self-supervision alone improves the performance over source-only training by 2.4% on average. We then combine adversarial training with multi-modal self-supervision, showing that our approach outperforms other UDA methods by 3%.

show abstract

Learning and Using the Arrow of Time

Cited by 346 publications

References 15 publications

Unsupervised Multi-Task Feature Learning on Point Clouds

Unsupervised Multi-Task Feature Learning on Point Clouds

Self-Supervised Learning of Class Embeddings from Video

Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

Contact Info

Product

Resources

About