Human perceptions of music and image are closely related to each other, since both can inspire similar human sensations, such as emotion, motion, and power. This paper aims to explore whether and how music and image can be automatically matched by machines. The main contributions are three aspects. First, we construct a benchmark dataset composed of more than 45, 000 music-image pairs. Human labelers are recruited to annotate whether these pairs are well-matched or not. The results show that they generally agree with each other on the matching degree of music-image pairs. Secondly, we investigate suitable semantic representations of music and image for this cross-modal matching task. In particular, we adopt lyric as a middle-media to connect music and image, and design a set of lyric-based attributes for image representation. Thirdly, we propose cross-modal ranking analysis (CMRA) to learn the semantic similarity between music and image with ranking labeling information. CMRA aims to find the optimal embedding spaces for both music and image in the sense of maximizing the ordinal margin between music-image pairs. The proposed method is able to learn the non-linear relationship between music and image, and to integrate heterogeneous ranking data from different modalities into a unified space. Experimental results demonstrate that the proposed method outperforms state-of-theart cross-modal methods in the music-image matching task, and achieves a consistency rate of 91.5% with human labelers.
Human perception of music and image are highly correlated. Both of them can inspire human sensation like emotion and power. This paper investigates how to model the relationship between music and image using 47,888 music-image pairs extracted from music videos. We have two basic observations for this relationship: 1) music space exhibits simpler cluster structure than image space, and 2) the relationship between the two spaces is complex and nonlinear. Based on these observations, we develop Multiple Ranking Canonical Correlation Analysis (MR-CCA) to learn such relationship. MR-CCA clusters the music-image pairs according to their music parts, and then conducts Ranking CCA (R-CCA) for each cluster. Compared with classical CCA, R-CCA takes account of the pairwise ranking information available in our dataset. MR-CCA improves performance and significantly reduce computational cost. Experiment results show that R-CCA outperforms CCA, and MR-CCA has the best performance with a consistency score of 84.52% with human labeling. The proposed method can be generalized to model cross media relationship and has potential applications in video generation, background music recommendation, and joint retrieval of music and image.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.