Efficient multi-modal retrieval in conceptual space

Imura, Jun; Fujisawa, Teppei; Harada, Tatsuya; Kuniyoshi, Yasuo

doi:10.1145/2072298.2071944

Cited by 7 publications

(4 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The sounds such as clothe chafing and footsteps emitted by people are used for human identification under the view of camera. CCA also benefits cross-modal and multi-modal retrieval for video and audio [41], in which the queries can be single-modal (image) or multi-modal (combination of image, audio, and location). Perhaps the closest work to our problem is [14], proposed by Zhang et al, which investigates the cross-modal relationship between an animal's image and its corresponding sounds.…”

Section: Related Workmentioning

confidence: 99%

Bridging Music and Image via Cross-Modal Ranking Analysis

Qiao

Wang

et al. 2016

IEEE Trans. Multimedia

View full text Add to dashboard Cite

Human perceptions of music and image are closely related to each other, since both can inspire similar human sensations, such as emotion, motion, and power. This paper aims to explore whether and how music and image can be automatically matched by machines. The main contributions are three aspects. First, we construct a benchmark dataset composed of more than 45, 000 music-image pairs. Human labelers are recruited to annotate whether these pairs are well-matched or not. The results show that they generally agree with each other on the matching degree of music-image pairs. Secondly, we investigate suitable semantic representations of music and image for this cross-modal matching task. In particular, we adopt lyric as a middle-media to connect music and image, and design a set of lyric-based attributes for image representation. Thirdly, we propose cross-modal ranking analysis (CMRA) to learn the semantic similarity between music and image with ranking labeling information. CMRA aims to find the optimal embedding spaces for both music and image in the sense of maximizing the ordinal margin between music-image pairs. The proposed method is able to learn the non-linear relationship between music and image, and to integrate heterogeneous ranking data from different modalities into a unified space. Experimental results demonstrate that the proposed method outperforms state-of-theart cross-modal methods in the music-image matching task, and achieves a consistency rate of 91.5% with human labelers.

show abstract

Section: Related Workmentioning

confidence: 99%

Bridging Music and Image via Cross-Modal Ranking Analysis

Qiao

Wang

et al. 2016

IEEE Trans. Multimedia

View full text Add to dashboard Cite

show abstract

“…However, text-based retrieval methods only let users describe the past in concrete language. Another direction is using images or videos as user queries (Imura et al 2011;Chandrasekhar et al 2014), which involves a content-based image retrieval method (Smeulders et al 2000). However, content-based image retrieval requires images or videos including objects and specific locations about what we want to remember as queries.…”

Section: Related Workmentioning

confidence: 99%

Egocentric Video Search via Physical Interactions

Miyanishi

Hirayama

Kong

et al. 2016

AAAI

View full text Add to dashboard Cite

Retrieving past egocentric videos about personal daily life is important to support and augment human memory. Most previous retrieval approaches have ignored the crucial feature of human-physical world interactions, which is greatly related to our memory and experience of daily activities. In this paper, we propose a gesture-based egocentric video retrieval framework, which retrieves past visual experience using body gestures as non-verbal queries. We use a probabilistic framework based on a canonical correlation analysis that models physical interactions through a latent space and uses them for egocentric video retrieval and re-ranking search results. By incorporating physical interactions into the retrieval models, we address the problems resulting from the variability of human motions. We evaluate our proposed method on motion and egocentric video datasets about daily activities in household settings and demonstrate that our egocentric video retrieval framework robustly improves retrieval performance when retrieving past videos from personal and even other persons' video archives.

show abstract

“…The key problem in cross-modal retrieval is how to measure the similarity among different media modalities, the existing methods usually focus on a common space in which the classical measure can be directly applied. The common spaces include correlative subspace [6,7,8], semantic space [8] and hash space [12].…”

Section: Introduction • Yansheng Lumentioning

confidence: 99%

“…Rasiwasia et al [8] apply CCA to learn the subspace that maximizes the correlation between image and text. Lmura et al [7] use GCCA to simultaneously focus on the correlation among image, sound and location information. Li et al [6] introduce CFA to seek transformations that best represent the association between two different modalities.…”

Section: Introduction • Yansheng Lumentioning

confidence: 99%

A Graph Model for Cross-modal Retrieval

Wang

Pan

Lu³

2013

Proceedings of 3rd International Conference on Multimedia Technology(ICMT-13)

View full text Add to dashboard Cite

Abstr act With the rapid growth of multimedia document on the web, cross-modal retrieval has become an important issue. The modality of a query is different from that of the retrieved results in cross-modal retrieval. In this paper, we propose a novel graph model, which not only combines content and semantics similarities through two Markov chains, but also utilizes the interaction between different modalities to attain the whole semantics information of a multimedia document. Content similarity focuses on the original features within each modality, while semantics similarity focuses on the semantic vectors in a common space. Both of them are very significant. Random forests method is used to map the original features into a semantic space. The ranked list for a query is achieved by highlighting an optimal path across the corresponding chain. Experiments on the Wikipedia dataset show that the performance of our model significantly outperforms those of existing approaches for cross-modal retrieval.

show abstract

Efficient multi-modal retrieval in conceptual space

Cited by 7 publications

References 5 publications

Bridging Music and Image via Cross-Modal Ranking Analysis

Bridging Music and Image via Cross-Modal Ranking Analysis

Egocentric Video Search via Physical Interactions

A Graph Model for Cross-modal Retrieval

Contact Info

Product

Resources

About