Deep Keyframe Detection in Human Action Videos

Yan, Xiang; Gilani, Syed Zulqarnain; Qin, Hanlin; Feng, Mingtao; Zhang, Liang; Mian, Ajmal

doi:10.48550/arxiv.1804.10021

Cited by 6 publications

(5 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Directing the model to attend to features from the most important frames prevents the model from overfitting to less important frames which may contain irrelevant information. Hard attention-based methods such as [29,37] detect a specific set of video frames that maximally contribute to the final prediction. Rather than selecting a small number of frames to keep for further analysis and discarding the rest, we propose a soft self-attention mechanism which assigns an importance weight to every frame.…”

Section: Temporal Attention Mechanismmentioning

confidence: 99%

Pose is all you need: the pose only group activity recognition system (POGARS)

Thilakarathne

Nibali

et al. 2022

Machine Vision and Applications

View full text Add to dashboard Cite

We introduce a novel deep learning-based group activity recognition approach called the Pose Only Group Activity Recognition System (POGARS), designed to use only tracked poses of people to predict the performed group activity. In contrast to existing approaches for group activity recognition, POGARS uses 1D CNNs to learn spatiotemporal dynamics of individuals involved in a group activity and forgo learning features from pixel data. The proposed model uses a spatial and temporal attention mechanism to infer person-wise importance and multi-task learning for simultaneously performing group and individual action classification. Experimental results confirm that POGARS achieves highly competitive results compared to state-of-the-art methods on a widely used public volleyball dataset despite only using tracked pose as input. Further, our experiments show by using pose only as input, POGARS has better generalization capabilities compared to methods that use RGB as input.

show abstract

Section: Temporal Attention Mechanismmentioning

confidence: 99%

Pose is all you need: the pose only group activity recognition system (POGARS)

Thilakarathne

Nibali

et al. 2022

Machine Vision and Applications

View full text Add to dashboard Cite

show abstract

“…Directing the model to attend to features from the most important frames prevents the model from overfitting to less important frames which may contain irrelevant information. Hard attention based methods such as [38,45] detect a specific set of video frames that maximally contribute to the final prediction. Rather than selecting a small number of frames to keep for further analysis and discarding the rest, we propose a soft self attention mechanism which assigns an importance weight to every frame.…”

Section: Temporal Attention Mechanismmentioning

confidence: 99%

Pose is all you need: The pose only group activity recognition system (POGARS)

Thilakarathne¹,

Nibali²,

He³

et al. 2021

Preprint

View full text Add to dashboard Cite

We introduce a novel deep learning based group activity recognition approach called the Pose Only Group Activity Recognition System (POGARS), designed to use only tracked poses of people to predict the performed group activity. In contrast to existing approaches for group activity recognition, POGARS uses 1D CNNs to learn spatiotemporal dynamics of individuals involved in a group activity and forgo learning features from pixel data. The proposed model uses a spatial and temporal attention mechanism to infer person-wise importance and multi-task learning for simultaneously performing group and individual action classification. Experimental results confirm that POGARS achieves highly competitive results compared to state-of-the-art methods on a widely used public volleyball dataset despite only using tracked pose as input. Further our experiments show by using pose only as input, POGARS has better generalization capabilities compared to methods that use RGB as input.

show abstract

“…Paradigm II) Solving the key frame video object detection in two steps, II-A) a temporal model (e.g., attention RNN, 3D/(2+1)D CNN, transformer) [15], [17], [39], [41], [69] is trained to detect the indices of the key frames, II-B) followed by object detection at the recognized key frames. In order to compare U-LanD framework against paradigm II, we consider a semi-automatic approach, where the ground-truth indices of the key frames are suggested by the cardiologist, followed Fig.…”

Section: Evaluationsmentioning

confidence: 99%

“…As a result, the available training video datasets suffer from two limitations: 1) videos are sparsely labelled, i.e., a small portion of frames in each video have ground-truth landmark labels; and 2) the labelled frames are extensively biased towards specific points in time, i.e., only key frames in each training video are labelled. Previous work mainly divides the problem of video object detection on key frames into sub-problems of key frame recognition [13]- [17] and object detection. They propose techniques such as self-supervised learning [18], semi-supervised learning [19], label propagation [20], registration [21], and temporal cycle-consistency [22].…”

Section: Introductionmentioning

confidence: 99%

U-LanD: Uncertainty-Driven Video Landmark Detection

Jafari¹,

Luong²,

Tsang³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper presents U-LanD, a framework for joint detection of key frames and landmarks in videos. We tackle a specifically challenging problem, where training labels are noisy and highly sparse. U-LanD builds upon a pivotal observation: a deep Bayesian landmark detector solely trained on key video frames, has significantly lower predictive uncertainty on those frames vs. other frames in videos. We use this observation as an unsupervised signal to automatically recognize key frames on which we detect landmarks. As a test-bed for our framework, we use ultrasound imaging videos of the heart, where sparse and noisy clinical labels are only available for a single frame in each video. Using data from 4,493 patients, we demonstrate that U-LanD can exceedingly outperform the state-of-the-art non-Bayesian counterpart by a noticeable absolute margin of 42% in R 2 score, with almost no overhead imposed on the model size. Our approach is generic and can be potentially applied to other challenging data with noisy and sparse training labels.

show abstract

Deep Keyframe Detection in Human Action Videos

Cited by 6 publications

References 33 publications

Pose is all you need: the pose only group activity recognition system (POGARS)

Pose is all you need: the pose only group activity recognition system (POGARS)

Pose is all you need: The pose only group activity recognition system (POGARS)

U-LanD: Uncertainty-Driven Video Landmark Detection

Contact Info

Product

Resources

About