2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016
DOI: 10.1109/cvpr.2016.291
|View full text |Cite
|
Sign up to set email alerts
|

Actions ~ Transformations

Abstract: What defines an action like "kicking ball"? We argue that the true meaning of an action lies in the change or transformation an action brings to the environment. In this paper, we propose a novel representation for actions by modeling an action as a transformation which changes the state of the environment before the action happens (precondition) to the state after the action (effect). Motivated by recent advancements of video representation using deep learning, we design a Siamese network which models the act… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
136
0
3

Year Published

2017
2017
2022
2022

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 187 publications
(139 citation statements)
references
References 53 publications
0
136
0
3
Order By: Relevance
“…Method UCF101 HMDB51 Slow Fusion CNN [12] 65.4% -LRCN [5] 82.9% -C3D [28] 85.2% -Two-Stream (AlexNet) [22] 88.0% 59.4% Two-Stream + LSTM [37] 88.6% -Two-Stream + Pooling [37] 88.2% -Transformation [33] 92.4% 62.0% Two-Stream (VGG-16) [6] 90.6% 58.2% Two-Stream + Fusion [6] 92.5% 65.4% TSN (BN-Inception) [32] 94.0% 68.5% Ours (VGG- 16) 93.2% 66.1% Ours (ResNet-50) 93.8% 66.5% Ours (BN-Inception) 94.6% 68.9% Table 7. Performance comparison with the state-of-the-art.…”
Section: Final Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Method UCF101 HMDB51 Slow Fusion CNN [12] 65.4% -LRCN [5] 82.9% -C3D [28] 85.2% -Two-Stream (AlexNet) [22] 88.0% 59.4% Two-Stream + LSTM [37] 88.6% -Two-Stream + Pooling [37] 88.2% -Transformation [33] 92.4% 62.0% Two-Stream (VGG-16) [6] 90.6% 58.2% Two-Stream + Fusion [6] 92.5% 65.4% TSN (BN-Inception) [32] 94.0% 68.5% Ours (VGG- 16) 93.2% 66.1% Ours (ResNet-50) 93.8% 66.5% Ours (BN-Inception) 94.6% 68.9% Table 7. Performance comparison with the state-of-the-art.…”
Section: Final Resultsmentioning
confidence: 99%
“…Since the optical flow data brings in a significant performance gain, it has recently been employed into many other action recognition methods [2,5,25,29,34,37,26,33,6]. However, the original two-stream method [22] has two main drawbacks: First, it only incorporate 10 consecutive optical flow frames, so that it cannot capture long-term temporal cues.…”
Section: Related Workmentioning
confidence: 99%
“…(c) Recurrent Spatial Networks [15] [22], which applies Recurrent Neural Networks, including LSTM or GRU to model temporal information in videos. (d) Other approaches [23] [24][25] [26], which use other solutions to generate compact features for video representation and classification.…”
Section: Deep Neural Network For Large-scale Video Classificationmentioning
confidence: 99%
“…Without rules for logical reasoning, many approaches often employ hand-crafted [19,24,34,43] or deeplearned features [8,9,23,36,44,45] of appearance and motion for action recognition. Recently, researchers attempt to use the semantic-level state changes [1,7,10,25,49,50] for video analysis. For example, Liu et al [25] adopted unary fluents to represent attributes of a single object, and binary fluents for two objects in egocentric videos, and then they used LSTM [11] to recognize which action is performed.…”
Section: Related Workmentioning
confidence: 99%