2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI) 2014
DOI: 10.1109/cbmi.2014.6849849
|View full text |Cite
|
Sign up to set email alerts
|

Automatic propagation of manual annotations for multimodal person identification in TV shows

Abstract: Abstract-In this paper an approach to human annotation propagation for person identification in the multimodal context is proposed. A system is used, which combines speaker diarization and face clustering to produce multimodal clusters. The whole multimodal clusters are later annotated rather than just single tracks, which is done by propagation. Optical character recognition systems provides initial annotation. Four different strategies, which select candidates for annotation, are tested. The initial results … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
2
0
1

Year Published

2018
2018
2020
2020

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 11 publications
0
2
0
1
Order By: Relevance
“…This is di cult because of the significant visual variation of character appearances in a TV show caused by pose, illumination, size, expression and occlusion, which can often exceed those due to identity. Recently there has been a growing interest in the use of the audio-track to aid identification [9,36,47] which comes for free with multimedia videos. However, because face and voice representations are usually not aligned, in prior work the query face cannot be directly compared to the audio track, necessitating the use of complex fusion systems to combine information from both modalites.…”
Section: One-shot Learning For Tv Show Character Retrievalmentioning
confidence: 99%
See 1 more Smart Citation
“…This is di cult because of the significant visual variation of character appearances in a TV show caused by pose, illumination, size, expression and occlusion, which can often exceed those due to identity. Recently there has been a growing interest in the use of the audio-track to aid identification [9,36,47] which comes for free with multimedia videos. However, because face and voice representations are usually not aligned, in prior work the query face cannot be directly compared to the audio track, necessitating the use of complex fusion systems to combine information from both modalites.…”
Section: One-shot Learning For Tv Show Character Retrievalmentioning
confidence: 99%
“…However, because face and voice representations are usually not aligned, in prior work the query face cannot be directly compared to the audio track, necessitating the use of complex fusion systems to combine information from both modalites. For example, [9] use clustering on face-tracks and diarised speaker segments after a round of human annotation for both, [36] use confidence labels from one modality to provide supervsion for the other modality, and [47] fuse the outputs of a face recognition model, and a clothing model, with a GMM-based speaker model. With a joint embedding, however, the query face image can be compared directly to the audio track, leading to an extremely simple solution which we describe below.…”
Section: One-shot Learning For Tv Show Character Retrievalmentioning
confidence: 99%
“…Dans Bazillon et al (2008), les auteurs ont montré que la correction des sorties d'un système de transcription automatique de la parole permet de diminuer le temps d'annotation. Une méthode d'apprentissage actif (active learning en anglais), proposée dans Budnik et al (2014), utilisée en conjonction avec des systèmes de reconnaissance automatique du locuteur et du visage, réduit davantage le nombre d'interactions homme-machine. Récemment, dans Broux et al (2016), nous avons proposé un système qui assiste la SRL et réduit le nombre d'interventions humaines.…”
Section: Travaux Précédentsunclassified