Intra- and Inter-Action Understanding via Temporal Action Parsing

Shao, Dian; Zhao, Yue; Dai, Bo; Lin, Dahua

doi:10.1109/cvpr42600.2020.00081

Cited by 55 publications

(24 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a result, most temporal localization methods [49-51, 66, 67, 84] contain a temporal proposal module to simply treat video segments that do not belong to pre-defined classes as the background. Temporal segmentation methods [18,43,60] typically divide a video into segments of actions, or sub-actions [62,63]. But still, those methods can only predict boundaries of pre-defined classes, not generic boundaries.…”

Section: Related Workmentioning

confidence: 99%

Exploring Temporal Granularity in Self-Supervised Video Representation Learning

Qu¹,

Li²,

Yuan³

et al. 2021

Preprint

View full text Add to dashboard Cite

This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations. In TeG, we sample a long clip from a video and a short clip that lies inside the long clip. We then extract their dense temporal embeddings. The training objective consists of two parts: a fine-grained temporal learning objective to maximize the similarity between corresponding temporal embeddings in the short clip and the long clip, and a persistent temporal learning objective to pull together global embeddings of the two clips. Our study reveals the impact of temporal granularity with three major findings. 1) Different video tasks may require features of different temporal granularities. 2) Intriguingly, some tasks that are widely considered to require temporal awareness can actually be well addressed by temporally persistent features. 3) The flexibility of TeG gives rise to state-of-the-art results on 8 video benchmarks, outperforming supervised pre-training in most cases.

show abstract

Section: Related Workmentioning

confidence: 99%

Exploring Temporal Granularity in Self-Supervised Video Representation Learning

Qu¹,

Li²,

Yuan³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…To use the high-level segment information to refine the erroneous predictions in low-level frames, we capture the relations between frames F 0 and its corresponding segments F 1 and F 2 and F 3 , respectively. Recently, Transformer (Vaswani et al 2017), a kind of attention layer, shows promising results in learning attentive weights/relationships with applications in images, words, and videos (Vaswani et al 2017;Shao et al 2020;Dosovitskiy et al 2020;Arnab et al 2021). Here, we adopt the transformer to learn the relationships between segments and frames.…”

Section: Segment-frame Attention (Sfa) Modulementioning

confidence: 99%

Exploring Segment-level Semantics for Online Phase Recognition from Surgical Videos

Ding¹,

Li²

2021

Preprint

View full text Add to dashboard Cite

Automatic surgical phase recognition plays an important role in robot-assisted surgeries. Existing methods ignored a pivotal problem that surgical phases should be classified by learning segment-level semantics instead of solely relying on frame-wise information. In this paper, we present a segmentattentive hierarchical consistency network (SAHC) for surgical phase recognition from videos. The key idea is to extract hierarchical high-level semantic-consistent segments and use them to refine the erroneous predictions caused by ambiguous frames. To achieve it, we design a temporal hierarchical network to generate hierarchical high-level segments. Then, we introduce a hierarchical segment-frame attention (SFA) module to capture relations between the low-level frames and high-level segments. By regularizing the predictions of frames and their corresponding segments via a consistency loss, the network can generate semantic-consistent segments and then rectify the misclassified predictions caused by ambiguous low-level frames. We validate SAHC on two public surgical video datasets, i.e., the M2CAI16 challenge dataset and the Cholec80 dataset. Experimental results show that our method outperforms previous state-of-the-arts by a large margin, notably reaches 4.1% improvements on M2CAI16. Code will be released at GitHub upon acceptance.

show abstract

“…Traditional fully-supervised deep learning methods typically require large amounts of annotated data, introducing a significant proneto-ambiguity annotation workload [36,37,47,55]. For this reason, learning with scarce data (i.e., few-shot learning) has received increasing attention, in domains like object detection [8,14,28,[42][43][44], action recognition [1,2,4,13,54,59], and action localization [9,15,51,52].…”

Section: Related Workmentioning

confidence: 99%

Few-Shot Action Localization without Knowing Boundaries

Xie

Tzelepis

et al. 2021

Proceedings of the 2021 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

Learning to localize actions in long, cluttered, and untrimmed videos is a hard task, that in the literature has typically been addressed assuming the availability of large amounts of annotated training samples for each class -either in a fully-supervised setting, where action boundaries are known, or in a weakly-supervised setting, where only class labels are known for each video. In this paper, we go a step further and show that it is possible to learn to localize actions in untrimmed videos when a) only one/few trimmed examples of the target action are available at test time, and b) when a large collection of videos with only class label annotation (some trimmed and some weakly annotated untrimmed ones) are available for training; with no overlap between the classes used during training and testing. To do so, we propose a network that learns to estimate Temporal Similarity Matrices (TSMs) that model a finegrained similarity pattern between pairs of videos (trimmed or untrimmed), and uses them to generate Temporal Class Activation Maps (TCAMs) for seen or unseen classes. The TCAMs serve as temporal attention mechanisms to extract video-level representations of untrimmed videos, and to temporally localize actions at test time. To the best of our knowledge, we are the first to propose a weakly-supervised, one/few-shot action localization network that can be trained in an end-to-end fashion. Experimental results on THUMOS14 and ActivityNet1.2 datasets, show that our method achieves performance comparable or better to state-of-the-art fullysupervised, few-shot learning methods. CCS CONCEPTS• Computing methodologies → Machine learning.

show abstract

Intra- and Inter-Action Understanding via Temporal Action Parsing

Cited by 55 publications

References 36 publications

Exploring Temporal Granularity in Self-Supervised Video Representation Learning

Exploring Temporal Granularity in Self-Supervised Video Representation Learning

Exploring Segment-level Semantics for Online Phase Recognition from Surgical Videos

Few-Shot Action Localization without Knowing Boundaries

Contact Info

Product

Resources

About