Learning to Localize Actions from Moments

Long, Fuchen; Yao, Ting; Qiu, Zhaofan; Tian, Xinmei; Luo, Jiebo; Mei, Tao

doi:10.1007/978-3-030-58580-8_9

Cited by 7 publications

(6 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, the supervision signal should not be limited to instance-or video-level labels. For example, some works [211], [212] first employ trimmed videos from video recognition benchmarks to learn action patterns, then localize action instances in untrimmed videos. Furthermore, the exploration of multiple modalities within video data is essential.…”

Section: Further Discussion and Promising Directionsmentioning

confidence: 99%

Temporal Action Localization in the Deep Learning Era: A Survey

Wang,

Zhao,

Yang

et al. 2024

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

The temporal action localization research aims to discover action instances from untrimmed videos, representing a fundamental step in the field of intelligent video understanding. With the advent of deep learning, backbone networks have been instrumental in providing representative spatiotemporal features, while the end-to-end learning paradigm has enabled the development of high-quality models through data-driven training. Both supervised and weakly supervised learning approaches have contributed to the rapid progress of temporal action localization, resulting in a multitude of methods and a large body of literature, making a comprehensive survey a pressing necessity. This paper presents a thorough analysis of existing action localization works, offering a well-organized taxonomy that highlights the strengths and weaknesses of each strategy. In the realm of supervised learning, in addition to the anchor mechanism, we introduce a novel classification mechanism to categorize and summarize existing works. Similarly, for weakly supervised learning, we extend the traditional pre-classification and post-classification mechanisms by providing a fresh perspective on enhancement strategies. Furthermore, we shed light on the bottleneck of confidence estimation, a critical yet overlooked aspect of current works. By conducting detailed analyses, this survey serves as a valuable resource for researchers, providing beneficial guidance to newcomers and inspiring seasoned researchers alike.

show abstract

Section: Further Discussion and Promising Directionsmentioning

confidence: 99%

Temporal Action Localization in the Deep Learning Era: A Survey

Wang,

Zhao,

Yang

et al. 2024

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

show abstract

“…The datasets commonly used for temporal action detection are mainly THUMOS14 [75], MEX-action2 [76], and ActivityNet [77]. The THUMOS14 dataset includes an action recognition part and a temporal action detection part.…”

Section: Action Detection 31 Action Detection Datasetsmentioning

confidence: 99%

Action Recognition and Detection Based on Deep Learning: A Comprehensive Summary

Li,

Liang,

Gan

et al. 2023

CMC

View full text Add to dashboard Cite

Action recognition and detection is an important research topic in computer vision, which can be divided into action recognition and action detection. At present, the distinction between action recognition and action detection is not clear, and the relevant reviews are not comprehensive. Thus, this paper summarized the action recognition and detection methods and datasets based on deep learning to accurately present the research status in this field. Firstly, according to the way that temporal and spatial features are extracted from the model, the commonly used models of action recognition are divided into the two stream models, the temporal models, the spatiotemporal models and the transformer models according to the architecture. And this paper briefly analyzes the characteristics of the four models and introduces the accuracy of various algorithms in common data sets. Then, from the perspective of tasks to be completed, action detection is further divided into temporal action detection and spatiotemporal action detection, and commonly used datasets are introduced. From the perspectives of the twostage method and one-stage method, various algorithms of temporal action detection are reviewed, and the various algorithms of spatiotemporal action detection are summarized in detail. Finally, the relationship between different parts of action recognition and detection is discussed, the difficulties faced by the current research are summarized in detail, and future development was prospected.

show abstract

“…Early approaches usually rely on hand-crafted features, which detect spatio-temporal interest points and then describe these points with local representations [45,46]. With the tremendous success of deep convolution networks on image-based classification tasks [12,35,38,41], researchers started to explore the application of deep networks on video action recognition task [7,18,29,30,54]. In [37], the famous twostream architecture is devised by applying two 2D CNN architectures separately on visual frames and staked opti-cal flows.…”

Section: Related Workmentioning

confidence: 99%

Representing Videos as Discriminative Sub-graphs for Action Recognition

Liu¹,

Qiu²,

Pan³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Human actions are typically of combinatorial structures or patterns, i.e., subjects, objects, plus spatio-temporal interactions in between. Discovering such structures is therefore a rewarding way to reason about the dynamics of interactions and recognize the actions. In this paper, we introduce a new design of sub-graphs to represent and encode the discriminative patterns of each action in the videos. Specifically, we present MUlti-scale Sub-graph LEarning (MUSLE) framework that novelly builds space-time graphs and clusters the graphs into compact sub-graphs on each scale with respect to the number of nodes. Technically, MUSLE produces 3D bounding boxes, i.e., tubelets, in each video clip, as graph nodes and takes dense connectivity as graph edges between tubelets. For each action category, we execute online clustering to decompose the graph into sub-graphs on each scale through learning Gaussian Mixture Layer and select the discriminative sub-graphs as action prototypes for recognition. Extensive experiments are conducted on both Something-Something V1 & V2 and Kinetics-400 datasets, and superior results are reported when comparing to state-of-the-art methods. More remarkably, our MUSLE achieves to-date the best reported accuracy of 65.0% on Something-Something V2 validation set.

show abstract

Learning to Localize Actions from Moments

Cited by 7 publications

References 45 publications

Temporal Action Localization in the Deep Learning Era: A Survey

Temporal Action Localization in the Deep Learning Era: A Survey

Action Recognition and Detection Based on Deep Learning: A Comprehensive Summary

Representing Videos as Discriminative Sub-graphs for Action Recognition

Contact Info

Product

Resources

About