Proceedings of the 22nd ACM International Conference on Multimedia 2014
DOI: 10.1145/2647868.2654931
|View full text |Cite
|
Sign up to set email alerts
|

Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification

Abstract: Videos contain very rich semantics and are intrinsically multimodal. In this paper, we study the challenging task of classifying videos according to their high-level semantics such as human actions or complex events. Although extensive efforts have been paid to study this problem, most existing works combined multiple features using simple fusion strategies and neglected the exploration of inter-class semantic relationships. In this paper, we propose a novel unified framework that jointly learns feature relati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
76
0
2

Year Published

2015
2015
2024
2024

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 134 publications
(79 citation statements)
references
References 44 publications
1
76
0
2
Order By: Relevance
“…The detailed analyses of these results are presented as follows. Hollywood-2 HMDB51 Jain et al [18] 62.5% 52.1% Oneata et al [14] 63.3% 54.8% Wang et al [12] 64.3% 57.2% Wu et al [19] 64.5% -Simonyan et al [20] -59.4% STFV+CombFV 66.72% 61.50% STFV+CombFV+ScaleFV 66.96% 61.07% may be due to the reason that STFV owns a larger dimension than ScaleFV, i.e., the dimension of STFV is (2 +1+7 ) , while the dimension of ScaleFV is (2 + 1 + 5 ) . Such a comparison shows that it may not be sufficient to consider the scale information solely for human action recognition, which inspires us to consider a more powerful encoding method to take both scale and spatial-temporal position information into account.…”
Section: Resultsmentioning
confidence: 99%
“…The detailed analyses of these results are presented as follows. Hollywood-2 HMDB51 Jain et al [18] 62.5% 52.1% Oneata et al [14] 63.3% 54.8% Wang et al [12] 64.3% 57.2% Wu et al [19] 64.5% -Simonyan et al [20] -59.4% STFV+CombFV 66.72% 61.50% STFV+CombFV+ScaleFV 66.96% 61.07% may be due to the reason that STFV owns a larger dimension than ScaleFV, i.e., the dimension of STFV is (2 +1+7 ) , while the dimension of ScaleFV is (2 + 1 + 5 ) . Such a comparison shows that it may not be sufficient to consider the scale information solely for human action recognition, which inspires us to consider a more powerful encoding method to take both scale and spatial-temporal position information into account.…”
Section: Resultsmentioning
confidence: 99%
“…Recently, deep learning technologies have been utilized in Web video classification and achieved significant performance improvement [25][26][27]. In [26], video classification system was presented using regularizations in deep neural networks.…”
Section: Web Video Classificationmentioning
confidence: 99%
“…In [26], video classification system was presented using regularizations in deep neural networks. Both features and class relationships were explored to obtain better video classification performance.…”
Section: Web Video Classificationmentioning
confidence: 99%
“…The equation can decrease the magnitude of the weights and is commonly used to prevent overfitting [47].…”
Section: The Frameworkmentioning
confidence: 99%