2012 IEEE Conference on Computer Vision and Pattern Recognition 2012
DOI: 10.1109/cvpr.2012.6247814
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal feature fusion for robust event detection in web videos

Abstract: Combining multiple low-level visual features is a proven and effective strategy for a range of computer vision tasks. However, limited attention has been paid to combining such features with information from other modalities, such as audio and videotext, for large scale analysis of web videos. In our work, we rigorously analyze and combine a large set of low-level features that capture appearance, color, motion, audio and audio-visual co-occurrence patterns in videos. We also evaluate the utility of high-level… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
126
0

Year Published

2013
2013
2021
2021

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 146 publications
(127 citation statements)
references
References 32 publications
1
126
0
Order By: Relevance
“…For video retrieval, significant research effort has been devoted in the form of event detection applied to the TRECVID MED collection. We refer the readers to an excellent state-of-the-art work of Natarajan et al [11] for more details. Video recognition has been an active research area in computer vision.…”
Section: Previous Workmentioning
confidence: 99%
“…For video retrieval, significant research effort has been devoted in the form of event detection applied to the TRECVID MED collection. We refer the readers to an excellent state-of-the-art work of Natarajan et al [11] for more details. Video recognition has been an active research area in computer vision.…”
Section: Previous Workmentioning
confidence: 99%
“…Distance from threshold This is a weighted averaging method [3] that dynamically adjusts the weights of each data type for each video clip based on how far the score is from its decision threshold. If the detection score is near the threshold, the correct decision is presumed to be somewhat uncertain, and a lower weight is assigned.…”
Section: Sparse Mixture Model (Smm)mentioning
confidence: 99%
“…In our approach, an event is modeled as a set of multiple bags-of-words, each based on a single data type. Partitioning the representation by data type permits the descriptors for each data type to be optimized independently (specific multimodal combinations of features, such as bimodal audiovisual features [3], can be considered a single data type within this architecture). The data types we used included both low-level features (visual appearance, motion, and audio) and higherlevel semantic concepts (visual concepts).…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations