Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition

Ghadiyaram, Deepti; Tran, Du; Mahajan, Dhruv

doi:10.1109/cvpr.2019.01232

Cited by 280 publications

(235 citation statements)

References 67 publications

(121 reference statements)

Supporting

Mentioning

231

Contrasting

Order By: Relevance

“…Analysis of results. Subject specific attributes such as male and bald are evidently more transferable from recognition (left columns of Table 1) than attributes that are related to Although this relationship has been noted by others, previous work used domain knowledge to determine which attributes are more transferable from identity [35], as others have done in other domains [20,38]. By comparison, our work shows how these relationships emerge from our estimation of transferability.…”

Section: Case Study: Identity To Facial Attributessupporting

confidence: 54%

Transferability and Hardness of Supervised Classification Tasks

Tran

Nguyen

Hassner

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

We propose a novel approach for estimating the difficulty and transferability of supervised classification tasks. Unlike previous work, our approach is solution agnostic and does not require or assume trained models. Instead, we estimate these values using an information theoretic approach: treating training labels as random variables and exploring their statistics. When transferring from a source to a target task, we consider the conditional entropy between two such variables (i.e., label assignments of the two tasks). We show analytically and empirically that this value is related to the loss of the transferred model. We further show how to use this value to estimate task hardness. We test our claims extensively on three large scale data sets-CelebA (40 tasks), Animals with Attributes 2 (85 tasks), and Caltech-UCSD Birds 200 (312 tasks)-together representing 437 classification tasks. We provide results showing that our hardness and transferability estimates are strongly correlated with empirical hardness and transferability. As a case study, we transfer a learned face recognition model to CelebA attribute classification tasks, showing state of the art accuracy for tasks estimated to be highly transferable.

show abstract

Section: Case Study: Identity To Facial Attributessupporting

confidence: 54%

Transferability and Hardness of Supervised Classification Tasks

Tran

Nguyen

Hassner

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

show abstract

“…Results and Discussion. Our results are presented in Table 6 and compared to state-of-the-art methods [69], [72], [96], [97], [98]. Our final model, using only RGB frames, achieves state-of-the-art results in comparison to all prior work, including those use optical flow [72], object detector [98] or audio data [69].…”

Section: Extension To Epic-kitchens Datasetmentioning

confidence: 92%

In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video

Liu

Rehg

2018

Lecture Notes in Computer Science

244

296

View full text Add to dashboard Cite

We address the task of jointly determining what a person is doing and where they are looking based on the analysis of video captured by a headworn camera. To facilitate our research, we first introduce the EGTEA Gaze+ dataset. Our dataset comes with videos, gaze tracking data, hand masks and action annotations, thereby providing the most comprehensive benchmark for First Person Vision (FPV). Moving beyond the dataset, we propose a novel deep model for joint gaze estimation and action recognition in FPV. Our method describes the participant's gaze as a probabilistic variable and models its distribution using stochastic units in a deep network. We further sample from these stochastic units, generating an attention map to guide the aggregation of visual features for action recognition. Our method is evaluated on our EGTEA Gaze+ dataset and achieves a performance level that exceeds the state-of-the-art by a significant margin. More importantly, we demonstrate that our model can be applied to larger scale FPV dataset-EPIC-Kitchens even without using gaze, offering new state-of-the-art results on FPV action recognition.

show abstract

“…Our ip-CSN-152 is still 0.6% lower than SlowFast augmented with Non-Local Networks. Finally, recent work [13] has shown that R(2+1)D can achieve strong performance when pre-trained on a large-scale weakly-supervised dataset. We pre-train/finetune ir-and ip-CSN-152 on the same dataset and compare it with R(2+1)D-152 (the last three rows of Table 5).…”

Section: Comparison With the State-of-the-artmentioning

confidence: 98%

Video Classification With Channel-Separated Convolutional Networks

Tran

Wang

Feiszli

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

568

343

View full text Add to dashboard Cite

Group convolution has been shown to offer great computational savings in various 2D convolutional architectures for image classification. It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks.This paper studies the effects of different design choices in 3D group convolutional networks for video classification. We empirically demonstrate that the amount of channel interactions plays an important role in the accuracy of 3D group convolutional networks. Our experiments suggest two main findings. First, it is a good practice to factorize 3D convolutions by separating channel interactions and spatiotemporal interactions as this leads to improved accuracy and lower computational cost. Second, 3D channel-separated convolutions provide a form of regularization, yielding lower training accuracy but higher test accuracy compared to 3D convolutions. These two empirical findings lead us to design an architecture -Channel-Separated Convolutional Network (CSN) -which is simple, efficient, yet accurate. On Sports1M, Kinetics, and Something-Something, our CSNs are comparable with or better than the state-of-the-art while being 2-3 times more efficient.

show abstract

Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition

Cited by 280 publications

References 67 publications

Transferability and Hardness of Supervised Classification Tasks

Transferability and Hardness of Supervised Classification Tasks

In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video

Video Classification With Channel-Separated Convolutional Networks

Contact Info

Product

Resources

About