Proceedings of the 14th ACM International Conference on Multimedia 2006
DOI: 10.1145/1180639.1180698
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal fusion using learned text concepts for image categorization

Abstract: Conventional image categorization techniques primarily rely on low-level visual cues. In this paper, we describe a multimodal fusion scheme which improves the image classification accuracy by incorporating the information derived from the embedded texts detected in the image under classification. Specific to each image category, a text concept is first learned from a set of labeled texts in images of the target category using Multiple Instance Learning [1]. For an image under classification which contains mult… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
20
0

Year Published

2010
2010
2019
2019

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 40 publications
(20 citation statements)
references
References 21 publications
0
20
0
Order By: Relevance
“…In the area of image classification, Zhu et al [156] have reported a multimodal fusion framework to classify the images that have embedded text within their spatial coordinates. The fusion process followed two steps.…”
Section: Support Vector Machinementioning
confidence: 99%
See 2 more Smart Citations
“…In the area of image classification, Zhu et al [156] have reported a multimodal fusion framework to classify the images that have embedded text within their spatial coordinates. The fusion process followed two steps.…”
Section: Support Vector Machinementioning
confidence: 99%
“…Bayesian inference method has been successfully used to fuse multimodal information (at the feature level and at the decision level) for performing various multimedia Multimodal fusion using visual and text cues for image classification based on pair-wise SVM classifier [156] analysis tasks. An example of Bayesian inference fusion at the feature level is the work by Pitsikalis et al [102] for audio-visual speech recognition.…”
Section: Bayesian Inferencementioning
confidence: 99%
See 1 more Smart Citation
“…To classify images, Wang et al [54] propose to combine visual cues with corresponding user tags. In [66], the authors exploit textual information by extracting visual features around the text regions and combining them with global visual features. Different from these works, in this paper, textual information is obtained from detected and recognized text in natural scene images.…”
Section: Related Workmentioning
confidence: 99%
“…The increasing availability of online digital video has rekindled interest in the problems of how to index multimedia information sources automatically and how to browse and manipulate them efficiently (David, 1998) (Snoek & Worring, 2005) (Zhu et al, 2006). The need for efficient content-based video indexing and retrieval has increased due to the rapid growth of video data available to consumers.…”
Section: Text Extraction For Video Indexingmentioning
confidence: 99%