2018 IEEE International Symposium on Multimedia (ISM) 2018
DOI: 10.1109/ism.2018.00-21
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual Embedding for Cross-Modal Music Video Retrieval through Supervised Deep CCA

Abstract: Deep learning has successfully shown excellent performance in learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities, such as audio and video, should be taken into account. Music video retrieval by a given musical audio is a natural way to search and interact with music contents. In this work, we study cross-modal music video retrieval in terms of emotion similarity. Part… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
23
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 37 publications
(23 citation statements)
references
References 21 publications
0
23
0
Order By: Relevance
“…Cross-modal recognition: Cross-modal recognition approaches using embedding have attracted much attention as a technique that can perform effective bidirectional recognition between different modalities (e.g., image, text and audio). Related to audio processing, some researchers explored cross-modal recognition between audio and image [42], [43] and the one between audio and text (lyrics) [44]. But, to the best of our knowledge, no existing work addresses crossmodal recognition between audio and emotion except our previous study [45], where MultiLayer Perceptrons (MLPs) based on CCA loss are used to compute music and emotion embeddings.…”
Section: Related Workmentioning
confidence: 99%
“…Cross-modal recognition: Cross-modal recognition approaches using embedding have attracted much attention as a technique that can perform effective bidirectional recognition between different modalities (e.g., image, text and audio). Related to audio processing, some researchers explored cross-modal recognition between audio and image [42], [43] and the one between audio and text (lyrics) [44]. But, to the best of our knowledge, no existing work addresses crossmodal recognition between audio and emotion except our previous study [45], where MultiLayer Perceptrons (MLPs) based on CCA loss are used to compute music and emotion embeddings.…”
Section: Related Workmentioning
confidence: 99%
“…Supervised approaches: In the case of supervised learning, the matching criterion that associates the audio and video modalities is deduced from additional sources of information. Typically, mood tags [26] [33] or projections into the valence-arousal plane [23] can be used to recommend musics and videos that have a similar emotional content. The use of mood information accelerates the training, and allows the systems to reach promising retrieval performances.…”
Section: A Music-video Embeddingsmentioning
confidence: 99%
“…Examples of systems for music recommendation given video as input are [11], [12], [23], [26], [27]. In a symmetrical way, examples of systems for video recommendation given audio as input are [13], [33].…”
Section: B Usages Of Music-video Embeddingsmentioning
confidence: 99%
See 1 more Smart Citation
“…Zheng et al implemented a cross-modal of an audio-video embedding algorithm through Supervised Deep Canonical Correlation Analysis (S-DCCA)[101]. In this model, audio and video are projected into a shared area to address the semantic distance between audio and video.…”
mentioning
confidence: 99%