DynamoNet: Dynamic Action and Motion Network

Diba, Ali; Sharma, Vivek; Gool, Luc Van; Stiefelhagen, Rainer

doi:10.1109/iccv.2019.00629

Cited by 115 publications

(54 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The superior results in Table 4 show that we achieve the comparable performances: 75.7% on the top-1 accuracy and 93.8% on the top-5 accuracy in validation set, which outperforms our baseline ECO-Lite-EN [17] by 5.7% on top-1 accuracy and by 4.4% on top-5 accuracy. It also outperforms the recent works STM [31] and DynamoNet-32F (ResNext101) [29] by 2.0% and 7.5% on top-1 accuracy and by 2.2% and 5.7% on top-5 accuracy, respectively.…”

Section: ) Results On Kinetics-400mentioning

confidence: 62%

“…Specifically, our SAST-EN significantly outperforms our baseline ECO-Lite-EN [17] by 1.6% on UCF101 and by 2.7% on HMDB51. It also outperforms the recent works STM [31], DistInit [28] and DynamoNet-32F (ResNext101) [29] by 0.2%, 10.6% and 3.3% on UCF101, and by 2.9%, 20.3% and 6.6% on HMDB51, respectively. Note that SAST-EN represents the average scores obtained from an ensemble of SAST network with the {16, 20, 24, 32} number of input frames similar to ECO-Lite-EN [17].…”

Section: Performance Comparison 1) Results On Ucf101 and Hmdb51mentioning

confidence: 66%

“…Girdhar et al [28] propose an approach DistInit to transfer image models to video, which takes the years of effort in collecting and labeling large and clean still-image datasets. Reference [29] presents a dynamic action and motion network (DynamoNet) using } over time and randomly select one frame from each segment and feed total of N frames into the weight shared 2D Deformable Convolutional network (2DDC) to extract semantic action-aware spatial features. Secondly, these discriminative spatial features are concatenated with time series to yield a preliminary video-level representation.…”

Section: Related Work a Action Recognition With Deep Learningmentioning

confidence: 99%

See 2 more Smart Citations

SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

Wang

Huang

et al. 2019

IEEE Access

View full text Add to dashboard Cite

The state-of-the-arts in action recognition are suffering from three challenges: (1) How to model spatial transformations of action since it is always geometric variation over time in videos. (2) How to develop the semantic action-aware temporal features from one video with a large proportion of irrelevant frames to the labeled action class, which hurt the final performance. (3) The action recognition speed of most existing models is too slow to be applied to actual scenes. In this paper, to address these three challenges, we propose a novel CNN-based action recognition method called SAST including three important modules, which can effectively learn semantic action-aware spatial-temporal features with a faster speed. Firstly, to learn action-aware spatial features (spatial transformations), we design a weight shared 2D Deformable Convolutional network named 2DDC with deformable convolutions whose receptive fields can be adaptively adjusted according to the complex geometric structure of actions. Then, we propose a light Temporal Attention model called TA to develop the action-aware temporal features that are discriminative for the labeled action category. Finally, we apply an effective 3D network to learn the temporal context between frames for building the final video-level representation. To improve the efficiency, we only utilize the raw RGB rather than optical flow and RGB as the input to our model. Experimental results on four challenging video recognition datasets Kinetics-400, Something-Something-V1, UCF101 and HMDB51 demonstrate that our proposed method can not only achieve comparable performances but be 10x to 50x faster than most of state-of-the-art action recognition methods.

show abstract

Section: ) Results On Kinetics-400mentioning

confidence: 62%

Section: Performance Comparison 1) Results On Ucf101 and Hmdb51mentioning

confidence: 66%

Section: Related Work a Action Recognition With Deep Learningmentioning

confidence: 99%

See 1 more Smart Citation

SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

Wang

Huang

et al. 2019

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Condconv [30] improves the model capacity by increasing the size and complexity of the kernel-generating function. Due to their advantages, dynamic filter networks have been applied in many areas, like human action recognition [7], super-resolution [29].…”

Section: Dynamic Filter Networkmentioning

confidence: 99%

Local-enhanced Interaction for Temporal Moment Localization

Liang

Zhang

2021

Proceedings of the 2021 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

Temporal moment localization via language aims to localize a video span in an untrimmed video which best matches the given natural language query. In most previous works, they try to match the whole query feature with multiple moment proposals, or match a global video embedding with phrase or word level query features. However, these coarse interaction models will become insufficient when the query-video contains more complex relationship. To address this issue, we propose a multi-branches interaction model for temporal moment localization. Specifically, the query sentence and video are encoded into multiple feature embeddings over several semantic sub-spaces. Then, each phrase embedding filters on a video feature to generate an attention sequence, which is used to reweight the video features. Moreover, a dynamic pointer decoder is developed to iteratively regress the temporal boundary, which can prevent our model from falling into a local optimum. To validate the proposed method, we have conducted extensive experiments on two popular benchmark datasets Charade-STA and TACoS. The experimental performance surpasses other state-of-the-arts methods, which demonstrates the effectiveness of our proposed model. CCS CONCEPTS• Information systems → Novelty in information retrieval.

show abstract

“…Recently, there were few works, which focused on exploiting the temporal information via concatenation of multiple frames at input, such as Sun et al (2015) and Diba et al (2019). The problem of these approaches lies in inability to scale well on long sequences.…”

Section: Exploiting Previous Frames Informationmentioning

confidence: 99%

Online supervised attention-based recurrent depth estimation from monocular video

Maslov

Makarov

2020

PeerJ Computer Science

View full text Add to dashboard Cite

Autonomous driving highly depends on depth information for safe driving. Recently, major improvements have been taken towards improving both supervised and self-supervised methods for depth reconstruction. However, most of the current approaches focus on single frame depth estimation, where quality limit is hard to beat due to limitations of supervised learning of deep neural networks in general. One of the way to improve quality of existing methods is to utilize temporal information from frame sequences. In this paper, we study intelligent ways of integrating recurrent block in common supervised depth estimation pipeline. We propose a novel method, which takes advantage of the convolutional gated recurrent unit (convGRU) and convolutional long short-term memory (convLSTM). We compare use of convGRU and convLSTM blocks and determine the best model for real-time depth estimation task. We carefully study training strategy and provide new deep neural networks architectures for the task of depth estimation from monocular video using information from past frames based on attention mechanism. We demonstrate the efficiency of exploiting temporal information by comparing our best recurrent method with existing image-based and video-based solutions for monocular depth reconstruction.

show abstract

DynamoNet: Dynamic Action and Motion Network

Cited by 115 publications

References 49 publications

SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

Local-enhanced Interaction for Temporal Moment Localization

Online supervised attention-based recurrent depth estimation from monocular video

Contact Info

Product

Resources

About