Multimodal fusion using learned text concepts for image categorization

Zhu, Qiaoming; Yeh, Mei-Chen; Cheng, Kwang-Ting

doi:10.1145/1180639.1180698

Cited by 40 publications

(20 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the area of image classification, Zhu et al [156] have reported a multimodal fusion framework to classify the images that have embedded text within their spatial coordinates. The fusion process followed two steps.…”

Section: Support Vector Machinementioning

confidence: 99%

“…Bayesian inference method has been successfully used to fuse multimodal information (at the feature level and at the decision level) for performing various multimedia Multimodal fusion using visual and text cues for image classification based on pair-wise SVM classifier [156] analysis tasks. An example of Bayesian inference fusion at the feature level is the work by Pitsikalis et al [102] for audio-visual speech recognition.…”

Section: Bayesian Inferencementioning

confidence: 99%

“…Similarly, for the video shot retrieval some researchers use mean average precision [83]. While performing image categorization, the accuracy of the classification is measured in terms of image category detection rate [156].…”

Section: Evaluation Measuresmentioning

confidence: 99%

See 2 more Smart Citations

Multimodal fusion for multimedia analysis: a survey

et al. 2010

View full text Add to dashboard Cite

This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia analysis tasks. The existing literature on multimodal fusion research is presented through several classifications based on the fusion methodology and the level of fusion (feature, decision, and hybrid). The fusion methods are described from the perspective of the basic concept, advantages, weaknesses, and their usage in various analysis tasks as reported in the literature. Moreover, several distinctive issues that influence a multimodal fusion process such as, the use of correlation and independence, confidence level, contextual information, synchronization between different modalities, and the optimal modality selection are also highlighted. Finally, we present the open issues for further research in the area of multimodal fusion.

show abstract

Section: Support Vector Machinementioning

confidence: 99%

Section: Bayesian Inferencementioning

confidence: 99%

See 1 more Smart Citation

Multimodal fusion for multimedia analysis: a survey

et al. 2010

View full text Add to dashboard Cite

show abstract

“…To classify images, Wang et al [54] propose to combine visual cues with corresponding user tags. In [66], the authors exploit textual information by extracting visual features around the text regions and combining them with global visual features. Different from these works, in this paper, textual information is obtained from detected and recognized text in natural scene images.…”

Section: Related Workmentioning

confidence: 99%

Words Matter: Scene Text for Image Classification and Retrieval

Karaoğlu

Tao

Gevers

et al. 2017

IEEE Trans. Multimedia

110

View full text Add to dashboard Cite

Abstract-Text in natural images typically adds meaning to an object or scene. In particular, text specifies which business places serve drinks (e.g. cafe, teahouse) or food (e.g. restaurant, pizzeria), and what kind of service is provided (e.g. massage, repair). The mere presence of text, its words and meaning are closely related to the semantics of the object or scene. This paper exploits textual contents in images for fine-grained business place classification and logo retrieval. There are four main contributions. First, we show that the textual cues extracted by the proposed method are effective for the two tasks. Combining the proposed textual and visual cues outperforms visual only classification and retrieval by a large margin. Second, to extract the textual cues, a generic and fully unsupervised word box proposal method is introduced. The method reaches state-of-theart word detection recall with a limited number of proposals. Third, contrary to what is widely acknowledged in text detection literature, we demonstrate that high recall in word detection is more important than high f-score at least for both tasks considered in this work. Last, this paper provides a large annotated text detection dataset with 10K images and 27601 word boxes.

show abstract

“…The increasing availability of online digital video has rekindled interest in the problems of how to index multimedia information sources automatically and how to browse and manipulate them efficiently (David, 1998) (Snoek & Worring, 2005) (Zhu et al, 2006). The need for efficient content-based video indexing and retrieval has increased due to the rapid growth of video data available to consumers.…”

Section: Text Extraction For Video Indexingmentioning

confidence: 99%