Automatic propagation of manual annotations for multimodal person identification in TV shows

Budnik, Mateusz; Poignant, Johann; Besacier, Laurent; Quénot, Georges

doi:10.1109/cbmi.2014.6849849

Cited by 4 publications

(3 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This is di cult because of the significant visual variation of character appearances in a TV show caused by pose, illumination, size, expression and occlusion, which can often exceed those due to identity. Recently there has been a growing interest in the use of the audio-track to aid identification [9,36,47] which comes for free with multimedia videos. However, because face and voice representations are usually not aligned, in prior work the query face cannot be directly compared to the audio track, necessitating the use of complex fusion systems to combine information from both modalites.…”

Section: One-shot Learning For Tv Show Character Retrievalmentioning

confidence: 99%

“…However, because face and voice representations are usually not aligned, in prior work the query face cannot be directly compared to the audio track, necessitating the use of complex fusion systems to combine information from both modalites. For example, [9] use clustering on face-tracks and diarised speaker segments after a round of human annotation for both, [36] use confidence labels from one modality to provide supervsion for the other modality, and [47] fuse the outputs of a face recognition model, and a clothing model, with a GMM-based speaker model. With a joint embedding, however, the query face image can be compared directly to the audio track, leading to an extremely simple solution which we describe below.…”

Section: One-shot Learning For Tv Show Character Retrievalmentioning

confidence: 99%

See 1 more Smart Citation

Learnable PINs: Cross-Modal Embeddings for Person Identity

Nagrani¹,

Albanie²,

Zisserman³

2018

Preprint

View full text Add to dashboard Cite

We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice. We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any identity labels, using a form of cross-modal self-supervision; second, we develop a curriculum learning schedule for hard negative mining targeted to this task that is essential for learning to proceed successfully; third, we demonstrate and evaluate cross-modal retrieval for identities unseen and unheard during training over a number of scenarios and establish a benchmark for this novel task; finally, we show an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas.

show abstract

Section: One-shot Learning For Tv Show Character Retrievalmentioning

confidence: 99%

Section: One-shot Learning For Tv Show Character Retrievalmentioning

confidence: 99%

Learnable PINs: Cross-Modal Embeddings for Person Identity

Nagrani¹,

Albanie²,

Zisserman³

2018

Preprint

View full text Add to dashboard Cite

show abstract

“…Dans Bazillon et al (2008), les auteurs ont montré que la correction des sorties d'un système de transcription automatique de la parole permet de diminuer le temps d'annotation. Une méthode d'apprentissage actif (active learning en anglais), proposée dans Budnik et al (2014), utilisée en conjonction avec des systèmes de reconnaissance automatique du locuteur et du visage, réduit davantage le nombre d'interactions homme-machine. Récemment, dans Broux et al (2016), nous avons proposé un système qui assiste la SRL et réduit le nombre d'interventions humaines.…”

Section: Travaux Précédentsunclassified

Segmentation et Regroupement en Locuteurs: comment évaluer les corrections humaines

Broux

Doukhan

Petitrenaud³

et al. 2018

XXXIIe Journées D’Études Sur La Parole

View full text Add to dashboard Cite

RÉSUMÉDans cet article, nous présentons un simulateur dédié à l'évaluation des corrections humaines sur la tâche de Segmentation et Regroupement en Locuteurs (SRL). Nous proposons quatre actions élémentaires afin de corriger une SRL et un automate pour simuler la séquence de corrections. Une mesure est proposée pour évaluer le coût de correction. Le simulateur est évalué en utilisant des émissions françaises d'information tirées du corpus REPERE. ABSTRACT Computer-assisted speaker diarization : how to evaluate human correctionsIn this paper, we present a framework to evaluate the human correction of a speaker diarization. We propose four elementary actions to correct the diarization and an automaton to simulate the correction sequence. A metric is described to evaluate the correction cost. The framework is evaluated using French broadcast news drawn from the REPERE corpus.

show abstract