Recurrent Tubelet Proposal and Recognition Networks for Action Detection

Li, Dong; Qiu, Zhaofan; Dai, Qi; Yao, Ting; Mei, Tao

doi:10.1007/978-3-030-01231-1_19

Cited by 113 publications

(80 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More recently the method proposed by Khan and Borji [27] used a fine-tuned version of RefineNet [16] in conjunction with Conditional Random Fields to achieve pixel-level hand segmentation and used the segmentation masks later with AlexNet for ego hand activity detection. Li et al [14] proposed the concept of recurrent tubelets proposal and recognition.In this approach the current area related to hand is extracted based on its previous location recurrently, and features are calculated on this extracted area. These features are then fed into a separate network for recognising gestures.…”

Section: Previous Workmentioning

confidence: 99%

“…These features are then fed into a separate network for recognising gestures. In all the above approaches [1,33,27,14], features were calculated on the extracted ego hand masks and then provided as an input to a different recognition system. Instead in our approach, we calculate features which can be both used for ego hand mask generation and ego gesture recognition simultaneously and also giving our network architecture ability to train end-to-end which wasn't possible in earlier approaches.…”

Section: Previous Workmentioning

confidence: 99%

See 1 more Smart Citation

Simultaneous Segmentation and Recognition: Towards More Accurate Ego Gesture Recognition

Chalasani

Smolić

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

Ego hand gestures can be used as an interface in AR and VR environments. While the context of an image is important for tasks like scene understanding, object recognition, image caption generation and activity recognition, it plays a minimal role in ego hand gesture recognition. An ego hand gesture used for AR and VR environments conveys the same information regardless of the background. With this idea in mind, we present our work on ego hand gesture recognition that produces embeddings from RBG images with ego hands, which are simultaneously used for ego hand segmentation and ego gesture recognition. To this extent, we achieved better recognition accuracy (96.9%) compared to the state of the art (92.2%) on the biggest ego hand gesture dataset available publicly. We present a gesture recognition deep neural network which recognises ego hand gestures from videos (videos containing a single gesture) by generating and recognising embeddings of ego hands from image sequences of varying length. We introduce the concept of simultaneous segmentation and recognition applied to ego hand gestures, present the network architecture, the training procedure and the results compared to the state of the art on the EgoGesture dataset [31].

show abstract

Section: Previous Workmentioning

confidence: 99%

Section: Previous Workmentioning

confidence: 99%

Simultaneous Segmentation and Recognition: Towards More Accurate Ego Gesture Recognition

Chalasani

Smolić

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

show abstract

“…T-CNN [12] and ACT [17] improve [28,35] by modeling short-term temporal information. RTPR [24] exploring long-term temporal dynamics with LSTM further boosts up the performance. [40] leads to video-mAP gain by modeling the relation between human and global context, but still yields inferior performance to our LSTR.…”

Section: Comparison With State-of-the-artmentioning

confidence: 99%

Long Short-Term Relation Networks for Video Action Detection

Yao

Qiu

et al. 2019

Proceedings of the 27th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

It has been well recognized that modeling human-object or objectobject relations would be helpful for detection task. Nevertheless, the problem is not trivial especially when exploring the interactions between human actor, object and scene (collectively as humancontext) to boost video action detectors. The difficulty originates from the aspect that reliable relations in a video should depend on not only short-term human-context relation in the present clip but also the temporal dynamics distilled over a long-range span of the video. This motivates us to capture both short-term and long-term relations in a video. In this paper, we present a new Long Short-Term Relation Networks, dubbed as LSTR, that novelly aggregates and propagates relation to augment features for video action detection. Technically, Region Proposal Networks (RPN) is remoulded to first produce 3D bounding boxes, i.e., tubelets, in each video clip. LSTR then models short-term human-context interactions within each clip through spatio-temporal attention mechanism and reasons long-term temporal dynamics across video clips via Graph Convolutional Networks (GCN) in a cascaded manner. Extensive experiments are conducted on four benchmark datasets, and superior results are reported when comparing to state-of-the-art methods. CCS CONCEPTS• Computing methodologies → Activity recognition and understanding.

show abstract

“…The first kind is video summarization methods [32,58], which generate a short synopsis for a long video. The second kind of methods [7,8,13,14,19,22,31,37,41] try to trim the video segment of interest. Using natural language as a query, [14,19] retrieve a specific temporal segment in a video, which shares the same semantic meaning as the query.…”

Section: Introductionmentioning

confidence: 99%

Spatio-Temporal Video Re-Localization by Warp LSTM

Feng

Liu

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

The need for efficiently finding the video content a user wants is increasing because of the erupting of usergenerated videos on the Web. Existing keyword-based or content-based video retrieval methods usually determine what occurs in a video but not when and where. In this paper, we make an answer to the question of when and where by formulating a new task, namely spatio-temporal video re-localization. Specifically, given a query video and a reference video, spatio-temporal video re-localization aims to localize tubelets in the reference video such that the tubelets semantically correspond to the query. To accurately localize the desired tubelets in the reference video, we propose a novel warp LSTM network, which propagates the spatiotemporal information for a long period and thereby captures the corresponding long-term dependencies. Another issue for spatio-temporal video re-localization is the lack of properly labeled video datasets. Therefore, we reorganize the videos in the AVA dataset to form a new dataset for spatio-temporal video re-localization research. Extensive experimental results show that the proposed model achieves superior performances over the designed baselines on the spatio-temporal video re-localization task.

show abstract

Recurrent Tubelet Proposal and Recognition Networks for Action Detection

Cited by 113 publications

References 32 publications

Simultaneous Segmentation and Recognition: Towards More Accurate Ego Gesture Recognition

Simultaneous Segmentation and Recognition: Towards More Accurate Ego Gesture Recognition

Long Short-Term Relation Networks for Video Action Detection

Spatio-Temporal Video Re-Localization by Warp LSTM

Contact Info

Product

Resources

About