We show quite good face clustering is possible for a dataset of inaccurately and ambiguously labelled face images. Our dataset is 44,773 face images, obtained by applying a face finder to approximately half a million captioned news images. This dataset is more realistic than usual face recognition datasets, because it contains faces captured "in the wild" in a variety of configurations with respect to the camera, taking a variety of expressions, and under illumination of widely varying color. Each face image is associated with a set of names, automatically extracted from the associated caption. Many, but not all such sets contain the correct name.We cluster face images in appropriate discriminant coordinates. We use a clustering procedure to break ambiguities in labelling and identify incorrectly labelled faces. A merging procedure then identifies variants of names that refer to the same individual. The resulting representation can be used to label faces in news images or to organize news pictures by individuals present.An alternative view of our procedure is as a process that cleans up noisy supervised data. We demonstrate how to use entropy measures to evaluate such procedures.
We discuss the properties of a collection of news photos and captions, collected from the Associated Press and Reuters. Captions have a vocabulary dominated by proper names. We have implemented various text clustering algorithms to organize these items by topic, as well as an iconic matcher that identifies articles that share a picture. We have found that the special structure of captions allows us to extract some names of people actually portrayed in the image quite reliably, using a simple syntactic analysis. We have been able to build a directory of face images of individuals from this collection.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.