Cycle-Contrast for Self-Supervised Video Representation Learning

Kong, Qing; Wei, Wenpeng; Deng, Ziwei; Yoshinaga, Tomoaki; Murakami, Tomokazu

doi:10.48550/arxiv.2010.14810

Cited by 7 publications

(8 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Also, the VSKD model tested with the left wrist acceleromeer data performs better compared to the previous study where accelerometer data from six locations were used [19]. In Table 3, while accelerometer data from the phone is the only modality in the testing phase, the method achieves better F-score performance compared to [11,21] in which either video streams or accelerometer data from phone and watch was used in the testing phase. This validates that the VSKD approach can effectively learn knowledge from the video modality to improve the accuracy performance of sensor-based HAR.…”

Section: Resultsmentioning

confidence: 82%

Cross-modal Knowledge Distillation for Vision-to-Sensor Action Recognition

Ni¹,

Sarbajna²,

Liu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Human activity recognition (HAR) based on multi-modal approach has been recently shown to improve the accuracy performance of HAR. However, restricted computational resources associated with wearable devices, i.e., smartwatch, failed to directly support such advanced methods. To tackle this issue, this study introduces an end-to-end Vision-to-Sensor Knowledge Distillation (VSKD) framework. In this VSKD framework, only time-series data, i.e., accelerometer data, is needed from wearable devices during the testing phase. Therefore, this framework will not only reduce the computational demands on edge devices, but also produce a learning model that closely matches the performance of the computational expensive multi-modal approach. In order to retain the local temporal relationship and facilitate visual deep learning models, we first convert time-series data to two-dimensional images by applying the Gramian Angular Field (GAF) based encoding method. We adopted ResNet18 and multi-scale TRN with BN-Inception as teacher and student network in this study, respectively. A novel loss function, named Distance and Angle-wised Semantic Knowledge loss (DASK), is proposed to mitigate the modality variations between the vision and the sensor domain. Extensive experimental results on UTD-MHAD, MMAct and Berkeley-MHAD datasets demonstrate the effectiveness and competitiveness of the proposed VSKD model which can deployed on wearable sensors.

show abstract

Section: Resultsmentioning

confidence: 82%

Cross-modal Knowledge Distillation for Vision-to-Sensor Action Recognition

Ni¹,

Sarbajna²,

Liu³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Under the linear probe setting, our method obtains the best results on both datasets. Specifically, our method with S3D and R3D-18 backbones outperform contrastive learning based approaches, CBT [54] and CCL [34], respectively, by a large margin. Even when compared with MemDPC [23] which leverages two stream information (RGB and flow), with larger resolution, our method still shows significant advantages.…”

Section: Evaluation On Downstream Tasksmentioning

confidence: 88%

“…Recently, inspired by the success of contrastive learning in static image, a line of works expanded contrastive learning pipeline to video domain [17,50,44,64,41]. Typically, [22,23] employed In-foNCE loss for dense future prediction, [34,24] performed instance discrimination across different domains to boost video representation. Though contrastive self-supervised learning contributes to better representation, the temporal information in videos is not well leveraged.…”

Section: Self-supervised Video Representation Learningmentioning

confidence: 99%

See 1 more Smart Citation

Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

Qian

Liu

et al. 2021

Preprint

View full text Add to dashboard Cite

The crux of self-supervised video representation learning is to build general features from unlabeled videos. However, most recent works have mainly focused on high-level semantics and neglected lower-level representations and their temporal relationship which are crucial for general video understanding. To address these challenges, this paper proposes a multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations. Concretely, high-level features obtained from naive and prototypical contrastive learning are utilized to build distribution graphs, guiding the process of low-level and mid-level feature learning. We also devise a simple temporal modeling module from multi-level features to enhance motion pattern learning. Experiments demonstrate that multi-level feature optimization with the graph constraint and temporal modeling can greatly improve the representation ability in video understanding. Code is available here.

show abstract

“…The few direct extensions of SimCLR for video (Bai et al 2020;Qian et al 2020;Lorre et al 2020) target action recognition on few seconds short clips. Others integrate contrastive learning by bringing together next-frame feature predictions with actual representations (Kong et al 2020;Lorre et al 2020), using path-object tracks for bringing cycleconsistency (Wang, Zhou, and Li 2020), and considering multiple viewpoints (Sermanet et al 2018) or accompanying modalities like audio (Alwassel et al 2019) or text (Miech et al 2020). We are inspired by these works to develop contrastive learning for long-range segmentation.…”

Section: Introductionmentioning

confidence: 99%

Iterative Contrast-Classify For Semi-supervised Temporal Action Segmentation

Singhania¹,

Rahaman²,

Yao³

2021

Preprint

View full text Add to dashboard Cite

Temporal action segmentation classifies the action of each frame in (long) video sequences. Due to the high cost of framewise labeling, we propose the first semi-supervised method for temporal action segmentation. Our method hinges on unsupervised representation learning, which, for temporal action segmentation, poses unique challenges. Actions in untrimmed videos vary in length and have unknown labels and start/end times. Ordering of actions across videos may also vary. We propose a novel way to learn frame-wise representations from temporal convolutional networks (TCNs) by clustering input features with added time-proximity condition and multiresolution similarity. By merging representation learning with conventional supervised learning, we develop an "Iterative-Contrast-Classify (ICC)" semi-supervised learning scheme. With more labelled data, ICC progressively improves in performance; ICC semi-supervised learning, with 40% labelled videos, performs similar to fully-supervised counterparts. Our ICC improves MoF by {+1.8, +5.6, +2.5}% on Breakfast, 50Salads and GTEA respectively for 100% labelled videos.

show abstract

Cycle-Contrast for Self-Supervised Video Representation Learning

Cited by 7 publications

References 23 publications

Cross-modal Knowledge Distillation for Vision-to-Sensor Action Recognition

Cross-modal Knowledge Distillation for Vision-to-Sensor Action Recognition

Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

Iterative Contrast-Classify For Semi-supervised Temporal Action Segmentation

Contact Info

Product

Resources

About