2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016
DOI: 10.1109/cvpr.2016.387
|View full text |Cite
|
Sign up to set email alerts
|

DeepCAMP: Deep Convolutional Action & Attribute Mid-Level Patterns

Abstract: The recognition of human actions and the determination of human attributes are two tasks that call for fine-grained classification. Indeed, often rather small and inconspicuous objects and features have to be detected to tell their classes apart. In order to deal with this challenge, we propose a novel convolutional neural network that mines mid-level image patches that are sufficiently dedicated to resolve the corresponding subtleties. In particular, we train a newly designed CNN (DeepPattern) that learns dis… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2016
2016
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 32 publications
(14 citation statements)
references
References 33 publications
0
14
0
Order By: Relevance
“…1). This is in contrast to existing methods that typically extract either global representations for the entire image [6,45,46,7] or video sequence [38,16], thus not focusing on the action itself, or localize the feature extraction process to the action itself via dense trajectories [43,42,9], optical flow [8,45,20] or actionness [44,3,14,48,39,24], thus failing to exploit contextual information. To the best of our knowledge, only two-stream networks [30,8,4,22] have attempted to jointly leverage both information types by making use of RGB frames in conjunction with optical flow to localize the action.…”
Section: Introductionmentioning
confidence: 90%
See 1 more Smart Citation
“…1). This is in contrast to existing methods that typically extract either global representations for the entire image [6,45,46,7] or video sequence [38,16], thus not focusing on the action itself, or localize the feature extraction process to the action itself via dense trajectories [43,42,9], optical flow [8,45,20] or actionness [44,3,14,48,39,24], thus failing to exploit contextual information. To the best of our knowledge, only two-stream networks [30,8,4,22] have attempted to jointly leverage both information types by making use of RGB frames in conjunction with optical flow to localize the action.…”
Section: Introductionmentioning
confidence: 90%
“…Most recent action approaches extract global representations for the entire image [6,46,7] or video sequence [38,16]. As such, these methods do not truly focus on the actions of interest, but rather compute a context-aware representation.…”
Section: Action Modelingmentioning
confidence: 99%
“…[119] focuses on the changes that an action brings into the environment and propose a siamese CNN architecture to fuse precondition and effect information from the environment. [20] proposes a CNN which uses mid-level discriminative visual elements. The method, called DeepPattern, is able to learn discriminative patches by exploring human body parts as well as scene context.…”
Section: Deep Learning With Fusion Strategiesmentioning
confidence: 99%
“…Caffe has been also used to implement 3D-CNN for action recognition (Tran et al, 2015;Poleg et al, 2016;Shou et al, 2016b;Wang et al, 2016d;Singh et al, 2016b), and motionbased approaches for both action (Simonyan and Zisserman, 2014;Singh et al, 2016a;Gkioxari and Malik, 2015) and gesture recognition (Wu et al, 2016b;Wang et al, 2017. Caffe is preferred to other frameworks for its speed and efficiency, especially in "fused" architectures for action recognition (Singh et al, 2016b;Deng et al, 2015;Diba et al, 2016;Peng and Schmid, 2016). Popular network types like FNN, CNN, LSTM, and RNN are fully supported by CNTK (Yu et al, 2014), which was started by speech processing researchers.…”
Section: Platformsmentioning
confidence: 99%