2018
DOI: 10.48550/arxiv.1807.04836
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Disjoint Mapping Network for Cross-modal Matching of Voices and Faces

Abstract: We propose a novel framework, called Disjoint Mapping Network (DIMNet), for cross-modal biometric matching, in particular of voices and faces. Different from the existing methods, DIMNet does not explicitly learn the joint relationship between the modalities. Instead, DIMNet learns a shared representation for different modalities by mapping them individually to their common covariates. These shared representations can then be used to find the correspondences between the modalities. We show empirically that DIM… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
7

Relationship

2
5

Authors

Journals

citations
Cited by 10 publications
(7 citation statements)
references
References 19 publications
0
7
0
Order By: Relevance
“…As the sample size increases, the accuracy decreases excessively Wen et al [42] . The correlation between modes is utilized Dataset acquisition is difficult Voice-Face Matching Wang et al [55] Can deal with multiple samples Can change the size of input Static image only; model complexity Hoover et al [45] Easy to implement Robust Efficient…”
Section: Methods Ideas and Strengths Weaknessesmentioning
confidence: 99%
See 1 more Smart Citation
“…As the sample size increases, the accuracy decreases excessively Wen et al [42] . The correlation between modes is utilized Dataset acquisition is difficult Voice-Face Matching Wang et al [55] Can deal with multiple samples Can change the size of input Static image only; model complexity Hoover et al [45] Easy to implement Robust Efficient…”
Section: Methods Ideas and Strengths Weaknessesmentioning
confidence: 99%
“…However, the correlation between the two modalities was not fully utilized in the above method. Therefore, Wen et al [42] proposed a disjoint mapping network (DIMNets) to fully use the covariates (e.g., gender and nationality) [43,44] to bridge the relation between voice and face information. The intuitive assumption was that for a given voice and face pair, the more covariates were shared between the two modalities, the higher the probability of being a match.…”
Section: Voice-facial Matchingmentioning
confidence: 99%
“…Kim et al [25] introduce triplet loss to learn overlapping information between faces and voices by using VGG16 [26] and SoundNet [27] for visual and auditory modality respectively. Wen et al [28] propose DIMNet to leverage identity-sensitive factors, such as nationality and gender, as supervision signals to learn a shared representation for different modalities. Based on the strong association between faces and voices, we propose to utilize face embedding to guide models in tracking desirable auditory output.…”
Section: Related Workmentioning
confidence: 99%
“…The associations between faces and speech have been widely studied in recent years. Cross-modal matching methods by classification [21,22,23,24] and metric learning [25,26] are adopted in identity verification and retrieval. Cross-modal features extracted from faces and speech are applied to disambiguate voiced and unvoiced consonants [27,28]; to track active speakers of a video [29,30]; or to predict emotion [31] and lip motions [28,32] from speech.…”
Section: Related Workmentioning
confidence: 99%