2018
DOI: 10.48550/arxiv.1805.00833
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learnable PINs: Cross-Modal Embeddings for Person Identity

Abstract: We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice. We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any identity labels, using a form of cross-modal self-supervision; second, we develop a curriculum learning schedule for hard negative mining targeted to this task that is essential for learning to proc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
4
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 46 publications
1
4
0
Order By: Relevance
“…We find that in all of our tests for which similar results have been reported by other researchers [14,13,6], our embeddings achieve comparable or better performance than those previously reported. We find that of all the covariates, ID provides the strongest supervision.…”
Section: Introductionsupporting
confidence: 92%
See 2 more Smart Citations
“…We find that in all of our tests for which similar results have been reported by other researchers [14,13,6], our embeddings achieve comparable or better performance than those previously reported. We find that of all the covariates, ID provides the strongest supervision.…”
Section: Introductionsupporting
confidence: 92%
“…This problem has seen significant research interest, in particular since the recent introduction of the VoxCeleb corpus [15], which comprises collections of video and audio recordings of a large number of celebrities. The existing approaches [14,13,6] have generally attempted to directly relate subjects' voice recordings and their face images, in order to find the correspondences between the two. Nagrani et al [14] formulate the mapping as a binary se-lection task: given a voice recording, one must successfully select the speaker's face from a pair of face images (or the reverse -given a face image, one must correctly select the subject's voice from a pair of voice recordings).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Subsequently, these ideas were extended to learning inputs either with image+text and image+sound pairs [68]. Audio-visual representation learning has seen applications in several tasks such as event classification [40], audiovisual localization [40,2], biometric matching [35], sound localization [37,2], person identification [34], action recognition [37], on/off-screen audio separation [12,37], and video captioning/description [64,36,20].…”
Section: Related Workmentioning
confidence: 99%
“…[25] formulates this task as a N -way classification problem. [24,14,38] propose to learn common embeddings for the cross-modal inputs, such that the matching can be performed using the learned embeddings. All of these, however, are essentially selection problems.…”
Section: Related Workmentioning
confidence: 99%