2013
DOI: 10.1007/s00138-013-0567-0
|View full text |Cite
|
Sign up to set email alerts
|

Discovering joint audio–visual codewords for video event detection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
28
0
13

Year Published

2014
2014
2018
2018

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 45 publications
(42 citation statements)
references
References 18 publications
1
28
0
13
Order By: Relevance
“…Xu et al [48] and Ye et al [50] adopted late fusion with specially designed methods to remove the noise of individually trained classifiers, and Jhuo et al used a joint audio-visual codebook for classification [16]. Our approach is fundamentally different from these state-of-the-art methods in its design and produces significantly higher performance.…”
Section: Results Of the Entire Frameworkmentioning
confidence: 94%
See 2 more Smart Citations
“…Xu et al [48] and Ye et al [50] adopted late fusion with specially designed methods to remove the noise of individually trained classifiers, and Jhuo et al used a joint audio-visual codebook for classification [16]. Our approach is fundamentally different from these state-of-the-art methods in its design and produces significantly higher performance.…”
Section: Results Of the Entire Frameworkmentioning
confidence: 94%
“…The approach was further enhanced in [18], where the temporal interaction of audio-visual features was investigated. Jhuo et al [16] improved the speed of training the audio-visual joint codebook by using standard local visual features like the SIFT, instead of the segmentation-based region features.…”
Section: Fusing Multiple Featuresmentioning
confidence: 99%
See 1 more Smart Citation
“…Jiang et al [10] grouped visual and audio features together with their temporal relationship and computed combined features from these groups. Similarly, Jhou et al [13] constructed a bigraph with temporal concurrency between visual words and employed a k-way segmentation algorithm to combine visual and audio features. In this work, we propose to construct a spatio-temporal bigraph and use the k-way segmentation algorithm to combine multiple features.…”
Section: Related Workmentioning
confidence: 99%
“…Fernando et al [12] captured video-wide temporal information for action recognition. Jhou et al [13] proposed to use the concurrent statistical information to construct a bipartite graph for feature fusion. In fact, these methods use the temporal relationship between audio feature and visual feature for early fusion.…”
Section: Introductionmentioning
confidence: 99%