When textual and visual information join forces for multimedia retrieval

Safadi, Bahjat; Sahuguet, Mathilde; Huet, Benoît

doi:10.1145/2578726.2578760

Cited by 44 publications

(18 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Secondly, we use the cross-media fusion [5] of three modalities and thirdly the random-walk approach of [12]. Fourth baseline method is the non-linear fusion [22] of all modalities and finally we compare our framework with the extension of the unifying fusion framework of [2] in the case of three modalities [9] in two cases: first with the SIFT visual descriptors and second with the state-of-the-art DCNN visual features. Our proposed framework combines SIFT with DCNN using PLS Regression, using non-linear graph-based fusion of all three modalities.…”

Section: Resultsmentioning

confidence: 99%

“…The aforementioned combination process is known as multimodal fusion. An example of a study investigating multimodal fusion is the work of [22], in which a framework for video retrieval is presented. This framework extends conventional text-based search by fusing textual and visual similarity scores in a simple non-linear way.…”

Section: Related Workmentioning

confidence: 99%

“…Alternative to the linear fusion method of Equation (1), the non-linear analogue has been considered in multimedia retrieval tasks [22]:…”

Section: Graph-based Fusion In Multimedia Retrievalmentioning

confidence: 99%

See 2 more Smart Citations

Multimedia retrieval based on non-linear graph-based fusion and partial least squares regression

Gialampoukidis

Moumtzidou

Liparas

et al. 2017

Multimed Tools Appl

View full text Add to dashboard Cite

Heterogeneous sources of information, such as images, videos, text and metadata are often used to describe different or complementary views of the same multimedia object, especially in the online news domain and in large annotated image collections. The retrieval of multimedia objects, given a multimodal query, requires the combination of several sources of information in an efficient and scalable way. Towards this direction, we provide a novel unsupervised framework for multimodal fusion of visual and textual similarities, which are based on visual features, visual concepts and textual metadata, integrating non-linear graph-based fusion and Partial Least Squares Regression. The fusion strategy is based on the construction of a multimodal contextual similarity matrix and the non-linear combination of relevance scores from query-based similarity vectors. Our framework can employ more than two modalities and high-level information, without increase in memory complexity, when compared to state-of-the-art baseline methods. The experimental comparison is done in three public multimedia collections in the multimedia retrieval task. The results have shown that the proposed method outperforms the baseline methods, in terms of Mean Average Precision and Precision@20.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Multimedia retrieval based on non-linear graph-based fusion and partial least squares regression

Gialampoukidis

Moumtzidou

Liparas

et al. 2017

Multimed Tools Appl

View full text Add to dashboard Cite

show abstract

“…A number of studies have been proposed to tackle this problem on using several training examples (typically 10 or 100 examples) [14,9,38,11,31,19,34,3,36]. Generally, in a state-of-the-art system, the event classifiers are trained by low-level and high-level features, and the final decision is derived from the fusion of the individual classification results.…”

Section: Related Workmentioning

confidence: 99%

Bridging the Ultimate Semantic Gap

Jiang

Meng

et al. 2015

Proceedings of the 5th ACM on International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

Semantic search in video is a novel and challenging problem in information and multimedia retrieval. Existing solutions are mainly limited to text matching, in which the query words are matched against the textual metadata generated by users. This paper presents a state-of-the-art system for event search without any textual metadata or example videos. The system relies on substantial video content understanding and allows for semantic search over a large collection of videos. The novelty and practicality is demonstrated by the evaluation in NIST TRECVID 2014, where the proposed system achieves the best performance. We share our observations and lessons in building such a stateof-the-art system, which may be instrumental in guiding the design of the future system for semantic search in video.

show abstract

“…Scene [8] (e.g., text, graphics drawings or images) and to use it for various applications [9,10] (e.g., multimedia search, retrieval or recommendation). An image is a visual representation of things, which is more intuitive than text.…”

Section: Comparison Results For Outlier Detection (60% Outliers) On Umentioning

confidence: 99%

On handbag recognition and recommendation

Wang¹

View full text Add to dashboard Cite

When textual and visual information join forces for multimedia retrieval

Cited by 44 publications

References 16 publications

Multimedia retrieval based on non-linear graph-based fusion and partial least squares regression

Multimedia retrieval based on non-linear graph-based fusion and partial least squares regression

Bridging the Ultimate Semantic Gap

On handbag recognition and recommendation

Contact Info

Product

Resources

About