Fully Convolutional Network for Multiscale Temporal Action Proposals

Guo, Dashan; Li, Wei; Fang, Xiangzhong

doi:10.1109/tmm.2018.2839534

Cited by 28 publications

(3 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1) Similarity of inter-categorical actions: Multiple Instance Learning: The idea of extracting pieces of the input as proposals to then in a second stage decide which of these proposals are indeed classified as positive has been widely used in object detection [67], [32], [49], [50], [82], [74] and action detection [11], [30], [90], [19], [34]. The main goal of the proposals extraction is to filter as much as possible the relevant and non-relevant information by identifying the negative parts of the sample (i.e.,background in the case of object detection and non-action in the case of action detection), in order to be discarded for the following classification stage.…”

Section: A Generation Of Action Proposalsmentioning

confidence: 99%

A Multi-stage deep architecture for summary generation of soccer videos

Sanabria¹,

Precioso²,

Mattei³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video content is present in an ever-increasing number of fields, both scientific and commercial. Sports, particularly soccer, is one of the industries that has invested the most in the field of video analytics, due to the massive popularity of the game and the emergence of new markets (such as sport betting markets). Previous state-of-the-art methods on soccer matches video summarization rely on handcrafted heuristics to generate summaries which is poorly generalizable, but these works have yet proven that multiple modalities help detect the best actions of the game. On the other hand, machine learning models with higher generalization potential have enter the field of summarization of general-purpose videos, offering several deep learning approaches. However, most of them exploit content specificities that are not appropriate for sport wholematch videos. Although video content has been for many years the main source for automatizing knowledge extraction in soccer, the data that records all the events happening on the field has become lately very important in sports analytics, since this event data provides richer context information and requires less processing. Considering that in automatic sports summarization, the goal is not only to show the most important actions of the game, but also to reproduce the storytelling of the whole match with as much emotion as the one evoked by human editors, we propose a method to generate the summary of a soccer match video exploiting both the audio and the event metadata of the entire match. The results show that our method can detect the actions of the match, identify which of these actions should belong to the summary and then propose multiple candidate summaries which are similar enough but with relevant variability to provide different options to the final editor. Furthermore, we show the generalization capability of our work since it can transfer knowledge between datasets from different broadcasting companies, from different competitions, acquired in different conditions, and corresponding to summaries of different lengths.

show abstract

Section: A Generation Of Action Proposalsmentioning

confidence: 99%

A Multi-stage deep architecture for summary generation of soccer videos

Sanabria¹,

Precioso²,

Mattei³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Temporal convolution neural network is a common method to model sequential information [30][31][32][33]. Convolution layer is demonstrated implicitly to learn absolute position information from the commonly used padding operation [22].…”

Section: Position Encoding In Convolutionmentioning

confidence: 99%

PiSLTRc: Position-informed Sign Language Transformer with Content-aware Convolution

Xie¹,

Zhao²,

Hu³

2021

Preprint

View full text Add to dashboard Cite

Since the superiority of Transformer in learning long-term dependency, the sign language Transformer model achieves remarkable progress in Sign Language Recognition (SLR) and Translation (SLT). However, there are several issues with the Transformer that prevent it from better sign language understanding. The first issue is that the self-attention mechanism learns sign video representation in a frame-wise manner, neglecting the temporal semantic structure of sign gestures. Secondly, the attention mechanism with absolute position encoding is direction and distance unaware, thus limiting its ability. To address these issues, we propose a new model architecture, namely PiSLTRc, with two distinctive characteristics: (i) contentaware and position-aware convolution layers. Specifically, we explicitly select relevant features using a novel content-aware neighborhood gathering method. Then we aggregate these features with position-informed temporal convolution layers, thus generating robust neighborhood-enhanced sign representation. (ii) injecting the relative position information to the attention mechanism in the encoder, decoder, and even encoder-decoder cross attention. Compared with the vanilla Transformer model, our model performs consistently better on three large-scale sign language benchmarks: PHOENIX-2014, PHOENIX-2014-T and CSL. Furthermore, extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on translation quality with +1.6 BLEU improvements.

show abstract

“…It leads to an important yet challenging task for video analysis: Temporal Action Localization (TAL), which requires to not only classify the untrimmed videos into specific categories accurately, but also locate the temporal boundaries of action instances precisely. Although substantial progress has been achieved on this task [41], [26], [39], [16], [6], [18], [10], [9], it is still limited for industrial applications due to the huge amount of temporal annotations used for training such a deep learning based model in a fully-supervised manner, which are laborintensive to annotate especially for a large-scale dataset. On the contrary, weak labels such as video-level labels are much easier to obtain, hence many current works try to handle this problem under weak supervision.…”

Section: Introductionmentioning

confidence: 99%

Transferable Knowledge-Based Multi-Granularity Aggregation Network for Temporal Action Localization: Submission to ActivityNet Challenge 2021

Su¹,

Zhuang²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

This technical report presents an overview of our solution used in the submission to 2021 HACS Temporal Action Localization Challenge on both Supervised Learning Track and Weakly-Supervised Learning Track. Temporal Action Localization (TAL) requires to not only precisely locate the temporal boundaries of action instances, but also accurately classify the untrimmed videos into specific categories. However, Weakly-Supervised TAL indicates locating the action instances using only video-level class labels. In this paper, to train a supervised temporal action localizer, we adopt Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals through "local and global" temporal context aggregation and complementary as well as progressive boundary refinement. As for the WSTAL, a novel framework is proposed to handle the poor quality of CAS generated by simple classification network, which can only focus on local discriminative parts, rather than locate the entire interval of target actions. Specifically, we propose to utilize convolutional kernels with varied dilation rates to enlarge the receptive fields, which is found to be capable of transferring the discriminative information to surrounding non-discriminative regions. Then we design a cascaded module with proposed Online Adversarial Erasing (OAE) mechanism to further mine more relevant regions of target actions through feeding the erased feature maps of discovered regions back to the system. Besides, inspired by the transfer learning method, we also adopt an additional module to transfer the knowledge from trimmed videos (HACS Clips dataset) to untrimmed videos (HACS Segments dataset), aiming at promoting the classification performance on untrimmed videos. Finally, we employ a boundary regression module embedded with Outer-Inner-Contrastive (OIC) loss to automatically predict the bound- * Corresponding author.aries based on the enhanced CAS. Our proposed scheme achieves 39.91 and 29.78 average mAP on the challenge testing set of supervised and weakly-supervised temporal action localization track respectively.

show abstract

Fully Convolutional Network for Multiscale Temporal Action Proposals

Cited by 28 publications

References 29 publications

A Multi-stage deep architecture for summary generation of soccer videos

A Multi-stage deep architecture for summary generation of soccer videos

PiSLTRc: Position-informed Sign Language Transformer with Content-aware Convolution

Transferable Knowledge-Based Multi-Granularity Aggregation Network for Temporal Action Localization: Submission to ActivityNet Challenge 2021

Contact Info

Product

Resources

About