2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01608
|View full text |Cite
|
Sign up to set email alerts
|

Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association

Abstract: Nowadays, we have witnessed the early progress on learning the association between voice and face automatically, which brings a new wave of studies to the computer vision community. However, most of the prior arts along this line (a) merely adopt local information to perform modality alignment and (b) ignore the diversity of learning difficulty across different subjects. In this paper, we propose a novel framework to jointly address the above-mentioned issues. Targeting at (a), we propose a two-level modality … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 27 publications
(9 citation statements)
references
References 27 publications
0
9
0
Order By: Relevance
“…Later studies take into consideration the interactions among different modalities. For instance, Wen et al [47] devised a twolevel loss, which leverages both local and global features on modality alignment. ADSM [6] adopts an adversarial matching network to extract the high-level semantical features.…”
Section: Cross-modal Biometric Matchingmentioning
confidence: 99%
“…Later studies take into consideration the interactions among different modalities. For instance, Wen et al [47] devised a twolevel loss, which leverages both local and global features on modality alignment. ADSM [6] adopts an adversarial matching network to extract the high-level semantical features.…”
Section: Cross-modal Biometric Matchingmentioning
confidence: 99%
“…Cross-Modal Face Matching [25,30,34,59,67] covers tasks where voices are used as queries to retrieve faces or vice versa. These tasks are inherently selection problems in which the best fit of a voice-face pair from the dataset is desired.…”
Section: Audio-visual Learningmentioning
confidence: 99%
“…This work studies to what extent voice can hint face geometry motivated by recent studies on voice-face matching and cross-modal learning [30,59,67]. Many physiological attributes are embedded in voices.…”
Section: Introductionmentioning
confidence: 99%
“…In unsupervised representation learning task with imbalanced data, [33], [34] identified hard-to-memory samples from tail classes by different outputs between the network and its pruned version. Some works used a threshold of the loss to distinguish atypical data [35] or not well-learned data [36] in voice-face mapping and semi-supervised learning, respectively. Here, we mainly focus on the heterogeneous data created by DA and follow the idea of noise distribution modeling to separate the ID and DAOOD samples.…”
Section: B Heterogeneous Data Separatingmentioning
confidence: 99%