2020
DOI: 10.1609/aaai.v34i01.5339
|View full text |Cite
|
Sign up to set email alerts
|

Semi-Supervised Multi-Modal Learning with Balanced Spectral Decomposition

Abstract: Cross-modal retrieval aims to retrieve the relevant samples across different modalities, of which the key problem is how to model the correlations among different modalities while narrowing the large heterogeneous gap. In this paper, we propose a Semi-supervised Multimodal Learning Network method (SMLN) which correlates different modalities by capturing the intrinsic structure and discriminative correlation of the multimedia data. To be specific, the labeled and unlabeled data are used to construct a similarit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 17 publications
(1 citation statement)
references
References 30 publications
0
1
0
Order By: Relevance
“…Similarly, to combine audio and visual modalities for unsupervised learning, existing works exploit the natural audio-visual correspondence in videos to formulate various self-supervised signals, which predict the cross-modal correspondence [314], [315], align the temporally corresponding representations [309], [316], [317], [318], or cluster their representations in a shared audio-visual latent space [208], [319]. Several works further explore audio, vision and language together for unsupervised representation learning by aligning different modalities in a shared multi-modal latent space [310], [320] or in a hierarchical latent space for audiovision and vision-language [308]. Open Challenges.…”
Section: Multi-modal Learning From Unlabeled Datamentioning
confidence: 99%
“…Similarly, to combine audio and visual modalities for unsupervised learning, existing works exploit the natural audio-visual correspondence in videos to formulate various self-supervised signals, which predict the cross-modal correspondence [314], [315], align the temporally corresponding representations [309], [316], [317], [318], or cluster their representations in a shared audio-visual latent space [208], [319]. Several works further explore audio, vision and language together for unsupervised representation learning by aligning different modalities in a shared multi-modal latent space [310], [320] or in a hierarchical latent space for audiovision and vision-language [308]. Open Challenges.…”
Section: Multi-modal Learning From Unlabeled Datamentioning
confidence: 99%