Multimedia event detection with multimodal feature fusion and temporal concept localization

Oh, Sung Jong; McCloskey, Scott; Kim, Ilseo; Vahdat, Arash; Cannons, Kevin; Hajimirsadeghi, Hossein; Mori, Greg; Perera, A. G. Amitha; Pandey, Megha; Corso, Jason J.

doi:10.1007/s00138-013-0525-x

Cited by 47 publications

(22 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When combined with a linear SVM, excellent results on the leading NIST TRECVID event detection benchmarks [54] are reported for scenarios where many and few examples are available. The CNN video representation outperforms more traditional video encodings such as improved dense trajectories [71], [72] and multimedia representations combining appearance, motion and audio features [49], [52], [81]. However, both the learned and engineered representations are incapable, nor intended to, recognize events when examples are completely absent.…”

Section: Introductionmentioning

confidence: 99%

Video2vec Embeddings Recognize Events When Examples Are Scarce

Habibian

Mensink

Snoek

2017

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Abstract-This paper aims for event recognition when video examples are scarce or even completely absent. The key in such a challenging setting is a semantic video representation. Rather than building the representation from individual attribute detectors and their annotations, we propose to learn the entire representation from freely available web videos and their descriptions using an embedding between video features and term vectors. In our proposed embedding, which we call Video2vec, the correlations between the words are utilized to learn a more effective representation by optimizing a joint objective balancing descriptiveness and predictability. We show how learning the Video2vec using a multimodal predictability loss, including appearance, motion and audio features, results in a better predictable representation. We also propose an event specific variant of Video2vec to learn a more accurate representation for the words, which are indicative of the event, by introducing a term sensitive descriptiveness loss. Our experiments on three challenging collections of web videos from the NIST TRECVID Multimedia Event Detection and Columbia Consumer Videos datasets demonstrate: i) the advantages of Video2vec over representations using attributes or alternative embeddings, ii) the benefit of fusing video modalities by an embedding over common strategies, iii) the complementarity of term sensitive descriptiveness and multimodal predictability for event recognition. By its ability to improve predictability of present day audio-visual video features, while at the same time maximizing their semantic descriptiveness, Video2vec leads to state-of-the-art accuracy for both few-and zero-example recognition of events in video.

show abstract

Section: Introductionmentioning

confidence: 99%

Video2vec Embeddings Recognize Events When Examples Are Scarce

Habibian

Mensink

Snoek

2017

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

show abstract

“…from the text search [9,15,23,6] or the visual search [28,5]. Due to the challenge of multimedia retrieval, features from multiple modalities are usually used to achieve better performance [20,8,24]. However, performing PRF on multimodal tasks such as event search is an important yet unaddressed problem.…”

Section: Introductionmentioning

confidence: 99%

Zero-Example Event Search using MultiModal Pseudo Relevance Feedback

Jiang

Mitamura

et al. 2014

Proceedings of International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

We propose a novel method MultiModal Pseudo Relevance Feedback (MMPRF) for event search in video, which requires no search examples from the user. Pseudo Relevance Feedback has shown great potential in retrieval tasks, but previous works are limited to unimodal tasks with only a single ranked list. To tackle the event search task which is inherently multimodal, our proposed MMPRF takes advantage of multiple modalities and multiple ranked lists to enhance event search performance in a principled way. The approach is unique in that it leverages not only semantic features, but also non-semantic low-level features for event search in the absence of training data. Evaluated on the TRECVID MEDTest dataset, the approach improves the baseline by up to 158% in terms of the mean average precision. It also significantly contributes to CMU Team's final submission in TRECVID-13 Multimedia Event Detection.

show abstract

“…Wang et al [38] discussed a notable system in TRECVID 2012 that is characterized by applying feature selection over so-called motion relativity features. Oh et al [31] presented a latent SVM event detector that enables for temporal evidence localization. Jiang et al [19] presented an efficient method to learn "optimal" spatial event representations from data.…”

Section: Related Workmentioning

confidence: 99%

“…A number of studies have been proposed to tackle this problem on using several training examples (typically 10 or 100 examples) [14,9,38,11,31,19,34,3,36]. Generally, in a state-of-the-art system, the event classifiers are trained by low-level and high-level features, and the final decision is derived from the fusion of the individual classification results.…”

Section: Related Workmentioning

confidence: 99%

Bridging the Ultimate Semantic Gap

Jiang

Meng

et al. 2015

Proceedings of the 5th ACM on International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

Semantic search in video is a novel and challenging problem in information and multimedia retrieval. Existing solutions are mainly limited to text matching, in which the query words are matched against the textual metadata generated by users. This paper presents a state-of-the-art system for event search without any textual metadata or example videos. The system relies on substantial video content understanding and allows for semantic search over a large collection of videos. The novelty and practicality is demonstrated by the evaluation in NIST TRECVID 2014, where the proposed system achieves the best performance. We share our observations and lessons in building such a stateof-the-art system, which may be instrumental in guiding the design of the future system for semantic search in video.

show abstract

Multimedia event detection with multimodal feature fusion and temporal concept localization

Cited by 47 publications

References 45 publications

Video2vec Embeddings Recognize Events When Examples Are Scarce

Video2vec Embeddings Recognize Events When Examples Are Scarce

Zero-Example Event Search using MultiModal Pseudo Relevance Feedback

Bridging the Ultimate Semantic Gap

Contact Info

Product

Resources

About