“…However, because face and voice representations are usually not aligned, in prior work the query face cannot be directly compared to the audio track, necessitating the use of complex fusion systems to combine information from both modalites. For example, [9] use clustering on face-tracks and diarised speaker segments after a round of human annotation for both, [36] use confidence labels from one modality to provide supervsion for the other modality, and [47] fuse the outputs of a face recognition model, and a clothing model, with a GMM-based speaker model. With a joint embedding, however, the query face image can be compared directly to the audio track, leading to an extremely simple solution which we describe below.…”