2015
DOI: 10.1109/tcsvt.2014.2333151
|View full text |Cite
|
Sign up to set email alerts
|

STAP: Spatial-Temporal Attention-Aware Pooling for Action Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
34
0

Year Published

2015
2015
2021
2021

Publication Types

Select...
5
2
2

Relationship

1
8

Authors

Journals

citations
Cited by 72 publications
(34 citation statements)
references
References 37 publications
0
34
0
Order By: Relevance
“…To further improve activity recognition, recent works focused on exploiting context [10,16,50], which represent and harness information in both temporal and/or spatial neighborhood, or on attention [51], which learns an adaptive confidence score to leverage this surrounding information. In this realm, Caba Heilbron et al [10] develop a semantic context encoder that exploits evidence of objects and scenes within video segments to improve activity detection effectiveness and efficiency.…”
Section: Activity Recognitionmentioning
confidence: 99%
See 1 more Smart Citation
“…To further improve activity recognition, recent works focused on exploiting context [10,16,50], which represent and harness information in both temporal and/or spatial neighborhood, or on attention [51], which learns an adaptive confidence score to leverage this surrounding information. In this realm, Caba Heilbron et al [10] develop a semantic context encoder that exploits evidence of objects and scenes within video segments to improve activity detection effectiveness and efficiency.…”
Section: Activity Recognitionmentioning
confidence: 99%
“…More recently, several works use temporal context to localize activities in videos [16] or to generate proposals [28]. Furthermore, Nguyen et al [51] present a pooling method that uses spatio-temporal attention for enhanced action recognition, while Pei et al [53] use temporal attention to gate neighboring observations in a RNN framework. Note that attention is also widely used in video captioning [34,44,48].…”
Section: Activity Recognitionmentioning
confidence: 99%
“…Since the current solution is specific for image parsing, we are also interested in generalizing the proposed method to other recognition tasks, such as image retrieval, and general k-NN classification applications. We also plan to leverage our work to video domain, i.e., action recognition [31] and human fixation prediction [32].…”
Section: Discussionmentioning
confidence: 99%
“…There have been some recent attempts [5,7] to extend objectness to actionness. They often measure the actionness by fusing different feature channels such as space-time saliency [24], optical flow [7], body configuration [16] and deep learning features [9], sometimes with human input like eye fixation [23]. However, compared to objectness, actionness in videos is still not sufficiently explored due to the computational intensity in video space and the subtlety of human actions.…”
Section: Related Workmentioning
confidence: 99%