Proceedings of the 17th ACM International Conference on Multimedia 2009
DOI: 10.1145/1631272.1631277
|View full text |Cite
|
Sign up to set email alerts
|

Short-term audio-visual atoms for generic video concept classification

Abstract: We investigate the challenging issue of joint audio-visual analysis of generic videos targeting at semantic concept detection. We propose to extract a novel representation, the Short-term Audio-Visual Atom (S-AVA), for improved concept detection. An S-AVA is defined as a short-term region track associated with regional visual features and background audio features. An effective algorithm, named ShortTerm Region tracking with joint Point Tracking and Region Segmentation (STR-PTRS), is developed to extract S-AVA… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
38
0

Year Published

2010
2010
2017
2017

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 52 publications
(38 citation statements)
references
References 30 publications
0
38
0
Order By: Relevance
“…Zhou et al [64] use Gaussian mixture models (GMM) instead of the standard BoW approach to describe an event as a SIFT-Bag. Jiang et al [19] follow the same direction by defining an event as a combination of short-term audio-visual atoms. Nevertheless, valuable spatial context is neglected due to the spatial structure-free BoW representation, which limits the potential of these methods.…”
Section: Related Workmentioning
confidence: 99%
“…Zhou et al [64] use Gaussian mixture models (GMM) instead of the standard BoW approach to describe an event as a SIFT-Bag. Jiang et al [19] follow the same direction by defining an event as a combination of short-term audio-visual atoms. Nevertheless, valuable spatial context is neglected due to the spatial structure-free BoW representation, which limits the potential of these methods.…”
Section: Related Workmentioning
confidence: 99%
“…Jiang et al [10] grouped visual and audio features together with their temporal relationship and computed combined features from these groups. Similarly, Jhou et al [13] constructed a bigraph with temporal concurrency between visual words and employed a k-way segmentation algorithm to combine visual and audio features.…”
Section: Related Workmentioning
confidence: 99%
“…Researches have proposed several approaches to make further use of the different information in videos. Jiang et al [10] introduced an audio-visual atom as joint audio-visual feature for video concept recognition. Jiang and Loui [11] used the temporal relationship between audio feature and visual feature to group clusters up, and then construct new features from the groups.…”
Section: Introductionmentioning
confidence: 99%
“…Early Fusion: In [1] an audio-visual representation named short-term audio-visual atom is proposed. It is a concatenation of color/texture, motion and auditory features.…”
Section: Introductionmentioning
confidence: 99%