Am I Done? Predicting Action Progress in Videos

Becattini, Federico; Uricchio, Tiberio; Seidenari, Lorenzo; Ballan, Lamberto; Bimbo, Alberto Del

doi:10.1145/3402447

Cited by 23 publications

(17 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Aliakbarian et al [5] proposed a two stage LSTM architecture which models context and action to perform early action recognition. Beccattini et al [50] designed ProgressNet, an approach capable of estimating the progress of actions and localizing them in space and time. De Geest and Tuytelaars [51] addressed early action recognition proposing a "feedback network" which uses two LSTM streams to interpret feature representations and model the temporal structure of subsequent observations.…”

Section: Early Action Recognition In Third Person Visionmentioning

confidence: 99%

Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video

Furnari

Farinella

2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

In this paper, we tackle the problem of egocentric action anticipation, i.e., predicting what actions the camera wearer will perform in the near future and which objects they will interact with. Specifically, we contribute Rolling-Unrolling LSTM, a learning architecture to anticipate actions from egocentric videos. The method is based on three components: 1) an architecture comprised of two LSTMs to model the sub-tasks of summarizing the past and inferring the future, 2) a Sequence Completion Pre-Training technique which encourages the LSTMs to focus on the different sub-tasks, and 3) a Modality ATTention (MATT) mechanism to efficiently fuse multi-modal predictions performed by processing RGB frames, optical flow fields and object-based features. The proposed approach is validated on EPIC-Kitchens, EGTEA Gaze+ and ActivityNet. The experiments show that the proposed architecture is state-of-the-art in the domain of egocentric videos, achieving top performances in the 2019 EPIC-Kitchens egocentric action anticipation challenge. The approach also achieves competitive performance on ActivityNet with respect to methods not based on unsupervised pre-training and generalizes to the tasks of early action recognition and action recognition. To encourage research on this challenging topic, we made our code, trained models, and pre-extracted features available at our web page: http://iplab.dmi.unict.it/rulstm.

show abstract

Section: Early Action Recognition In Third Person Visionmentioning

confidence: 99%

Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video

Furnari

Farinella

2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

show abstract

“…The early action recognition task [22], [23], [24], [1] is to recognize the ongoing action as early as possible from partial observations. In this task, the model is only allowed to observe a part of the action videos, and predict the action based on the video segment [25], [26].…”

Section: B Early Action Recognitionmentioning

confidence: 99%

Learning to Anticipate Egocentric Actions by Imagination

Wu,

Zhu,

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

“…Multi-stream architectures have been widely employed for action [20,22,23,[32][33][34] and gesture recognition [12][13][14][15][16]27,35]. This technique consists of processing different versions of the same video in parallel with two or more CNNs.…”

Section: Multi-stream Gesture Recognitionmentioning

confidence: 99%

“…However, gesture spotting is needed for practical applications, since the duration and temporal boundaries of gestures are commonly unknown in practice [17,18]. It is worth noting that temporal action proposal generation (TAPG) is similar to gesture spotting, and receives more attention from the research community [19][20][21][22][23][24][25][26]. TAPG generates video segment proposals (candidates) that may contain human action instances from untrimmed videos.…”

Section: Introductionmentioning

confidence: 99%

Finger Gesture Spotting from Long Sequences Based on Multi-Stream Recurrent Neural Networks

Benitez-Garcia

Haris

Tsuda

et al. 2020

Sensors

View full text Add to dashboard Cite

Gesture spotting is an essential task for recognizing finger gestures used to control in-car touchless interfaces. Automated methods to achieve this task require to detect video segments where gestures are observed, to discard natural behaviors of users’ hands that may look as target gestures, and be able to work online. In this paper, we address these challenges with a recurrent neural architecture for online finger gesture spotting. We propose a multi-stream network merging hand and hand-location features, which help to discriminate target gestures from natural movements of the hand, since these may not happen in the same 3D spatial location. Our multi-stream recurrent neural network (RNN) recurrently learns semantic information, allowing to spot gestures online in long untrimmed video sequences. In order to validate our method, we collect a finger gesture dataset in an in-vehicle scenario of an autonomous car. 226 videos with more than 2100 continuous instances were captured with a depth sensor. On this dataset, our gesture spotting approach outperforms state-of-the-art methods with an improvement of about 10% and 15% of recall and precision, respectively. Furthermore, we demonstrated that by combining with an existing gesture classifier (a 3D Convolutional Neural Network), our proposal achieves better performance than previous hand gesture recognition methods.

show abstract

Am I Done? Predicting Action Progress in Videos

Cited by 23 publications

References 55 publications

Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video

Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video

Learning to Anticipate Egocentric Actions by Imagination

Finger Gesture Spotting from Long Sequences Based on Multi-Stream Recurrent Neural Networks

Contact Info

Product

Resources

About