2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00513
|View full text |Cite
|
Sign up to set email alerts
|

Video Face Clustering With Unknown Number of Clusters

Abstract: Understanding videos such as TV series and movies requires analyzing who the characters are and what they are doing. We address the challenging problem of clustering face tracks based on their identity. Different from previous work in this area, we choose to operate in a realistic and difficult setting where: (i) the number of characters is not known a priori; and (ii) face tracks belonging to minor or background characters are not discarded.To this end, we propose Ball Cluster Learning (BCL), a supervised app… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
42
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 51 publications
(43 citation statements)
references
References 49 publications
0
42
0
Order By: Relevance
“…Thus, they can generalize well on the unseen videos. Based on the pretrained models, many interesting downstream tasks have been explored, such as deep clustering [30,33], face clustering [34,61], person search [67], person clustering [4], as well as speaker diarization [10,41]. The above verification models are all uni-modal.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…Thus, they can generalize well on the unseen videos. Based on the pretrained models, many interesting downstream tasks have been explored, such as deep clustering [30,33], face clustering [34,61], person search [67], person clustering [4], as well as speaker diarization [10,41]. The above verification models are all uni-modal.…”
Section: Related Workmentioning
confidence: 99%
“…Backbone. Pretrained on large-scale datasets, speaker verification, and face recognition models have strong generalization ability and are directly applied in various downstream tasks [4,10,30,33,34,41,61,67]. We follow these works and utilize the backbones of off-the-shelf models [8,17] to encode voice and face features respectively.…”
Section: Audio-visual Relation Networkmentioning
confidence: 99%
See 3 more Smart Citations