2019 14th IEEE International Conference on Automatic Face &Amp; Gesture Recognition (FG 2019) 2019
DOI: 10.1109/fg.2019.8756609
|View full text |Cite
|
Sign up to set email alerts
|

Self-Supervised Learning of Face Representations for Video Face Clustering

Abstract: Analyzing the story behind TV series and movies often requires understanding who the characters are and what they are doing. With improving deep face models, this may seem like a solved problem. However, as face detectors get better, clustering/identification needs to be revisited to address increasing diversity in facial appearance. In this paper, we address video face clustering using unsupervised methods. Our emphasis is on distilling the essential information, identity, from the representations obtained us… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
52
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
3
3

Relationship

2
7

Authors

Journals

citations
Cited by 33 publications
(52 citation statements)
references
References 48 publications
0
52
0
Order By: Relevance
“…Finally, it is worth noting the self-supervised learning works on "harvesting" training data from unlabeled sources for action recognition. Fernando et al [12] and Mishra et al [28] shuffle the video frames and treat them as positive/negative training data; Sharma et al [34] mines labels using a distance matrix based on similarity although for video face clustering; Wei et al [51] divides a single clip into non-overlapping 10-frame chunks, and then predict the ordering task; Ng et al [29] estimates optical flow while recognizing actions. We compare all these methods against our unsupervised future frame prediction based ConvNet training in the experimental section.…”
Section: Background and Related Workmentioning
confidence: 99%
“…Finally, it is worth noting the self-supervised learning works on "harvesting" training data from unlabeled sources for action recognition. Fernando et al [12] and Mishra et al [28] shuffle the video frames and treat them as positive/negative training data; Sharma et al [34] mines labels using a distance matrix based on similarity although for video face clustering; Wei et al [51] divides a single clip into non-overlapping 10-frame chunks, and then predict the ordering task; Ng et al [29] estimates optical flow while recognizing actions. We compare all these methods against our unsupervised future frame prediction based ConvNet training in the experimental section.…”
Section: Background and Related Workmentioning
confidence: 99%
“…Most of these studies attempt to utilize redundant information of face sequences/sets to improve recognition performance, but not to learn discriminative features from sequence data. Recently, some approaches [34,[36][37][38][39] aim to learn deep video features for video face recognition. In [37], large-scale unlabeled face sequences are employed as the training data, but these sequence data are only utilized to learn transformations between image and video domains.…”
Section: Sequences In Face Recognitionmentioning
confidence: 99%
“…In [37], large‐scale unlabeled face sequences are employed as the training data, but these sequence data are only utilized to learn transformations between image and video domains. [38] trained a self‐supervised Siamese network to obtain the features of a face cluster instead of a single face. [39] also aims to cluster faces in videos more accurately.…”
Section: Related Workmentioning
confidence: 99%
“…Ignoring tracks, metrics are learned by ranking a batch of frames and creating hard positive and negative pairs [39]. However, all of the above methods require knowledge of the number of clusters K that is difficult to estimate beforehand; and only consider primary characters (tracks for background characters are ignored).…”
Section: Related Workmentioning
confidence: 99%