Is Space-Time Attention All You Need for Video Understanding?

Bertasius, Gedas; Wang, Heng; Torresani, Lorenzo

doi:10.48550/arxiv.2102.05095

Cited by 393 publications

(340 citation statements)

References 34 publications

Supporting

Mentioning

337

Contrasting

Order By: Relevance

“…Recently, Transformer-based models [38,67,83,90] have achieved promising performance in various vision tasks, such as image recognition [6,14,21,39,[50][51][52]52,75,90] and image restoration [11,40,89]. Some methods have tried to use Transformer for video modelling by extending the attention mechanism to the temporal dimension [2,3,38,53,60]. However, most of them are designed for visual recognition, which are fundamentally different from restoration tasks.…”

Section: Vision Transformermentioning

confidence: 99%

VRT: A Video Restoration Transformer

Liang¹,

Cao²,

Yi³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which either is restricted by frame-by-frame restoration or lacks longrange modelling ability. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal mutual self attention (TMSA) and parallel warping. TMSA divides the video into small clips, on which mutual attention is applied for joint motion estimation, feature alignment and feature fusion, while self attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on three tasks, including video super-resolution, video deblurring and video denoising, demonstrate that VRT outperforms the state-of-the-art methods by large margins (up to 2.16dB) on nine benchmark datasets.

show abstract

Section: Vision Transformermentioning

confidence: 99%

VRT: A Video Restoration Transformer

Liang¹,

Cao²,

Yi³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Early works on HowTo100M have focused on leveraging this large collection for learning models that can be transferred to other tasks, such as action recognition [4,37,38], video captioning [24,36,66], or text-video retrieval [7,37,61]. The problem of recognizing the task performed in the instructional video has been considered by Bertasius et al [8]. However, their proposed approach does…”

Section: Related Workmentioning

confidence: 99%

“…The similarity between two embedding vectors is chosen to be the dot product between the two vectors. We use a total of S = 10, 588 steps collected from the T = 1059 tasks used in the evaluation of Bertasius et al [8]. This represents the subset of wikiHow tasks that have at least 100 video samples in the HowTo100M dataset.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…We implement our video model using the code base of TimeSformer [8] and we follow its training configuration for HowTo100M, unless otherwise specified. All methods and baselines based on TimeSformer start from a configuration of ViT initialized with ImageNet-21K ViT pretraining [14].…”

Section: Implementation Detailsmentioning

confidence: 99%

“…4.3). In the ablations, in order to reduce the computational cost, we use a smaller subset corresponding to the collection of 80K long videos defined by Bertasius et al [8]. Classification of Procedural Activities.…”

Section: Datasets and Evaluation Metricsmentioning

confidence: 99%

See 2 more Smart Citations

Learning To Recognize Procedural Activities with Distant Supervision

Lin¹,

Petroni²,

Bertasius³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper we consider the problem of classifying fine-grained, multi-step activities (e.g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes. Accurately categorizing these activities requires not only recognizing the individual steps that compose the task but also capturing their temporal dependencies. This problem is dramatically different from traditional action classification, where models are typically optimized on videos that span only a few seconds and that are manually trimmed to contain simple atomic actions. While step annotations could enable the training of models to recognize the individual steps of procedural activities, existing large-scale datasets in this area do not include such segment labels due to the prohibitive cost of manually annotating temporal boundaries in long videos. To address this issue, we propose to automatically identify steps in instructional videos by leveraging the distant supervision of a textual knowledge base (wikiHow) that includes detailed descriptions of the steps needed for the execution of a wide variety of complex activities. Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base. We demonstrate that video models trained to recognize these automatically-labeled steps (without manual supervision) yield a representation that achieves superior generalization performance on four downstream tasks: recognition of procedural activities, step classification, step forecasting and egocentric video classification.

show abstract

Masked Video Modeling with Correlation-Aware Contrastive Learning for Breast Cancer Diagnosis in Ultrasound

Lin

Huang

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Breast cancer is one of the leading causes of cancer deaths in women. As the primary output of breast screening, breast ultrasound (US) video contains exclusive dynamic information for cancer diagnosis. However, training models for video analysis is non-trivial as it requires a voluminous dataset which is also expensive to annotate. Furthermore, the diagnosis of breast lesion faces unique challenges such as inter-class similarity and intra-class variation. In this paper, we propose a pioneering approach that directly utilizes US videos in computer-aided breast cancer diagnosis. It leverages masked video modeling as pretraning to reduce reliance on dataset size and detailed annotations. Moreover, a correlation-aware contrastive loss is developed to facilitate the identifying of the internal and external relationship between benign and malignant lesions. Experimental results show that our proposed approach achieved promising classification performance and can outperform other state-of-the-art methods.

show abstract

Is Space-Time Attention All You Need for Video Understanding?

Cited by 393 publications

References 34 publications

VRT: A Video Restoration Transformer

VRT: A Video Restoration Transformer

Learning To Recognize Procedural Activities with Distant Supervision

Masked Video Modeling with Correlation-Aware Contrastive Learning for Breast Cancer Diagnosis in Ultrasound

Contact Info

Product

Resources

About