The analysis of frame sequences in talk show videos, which is necessary for media mining and television production, requires significant manual efforts and is a very time-consuming process. Given the vast amount of unlabeled face frames from talk show videos, we address and propose a solution to the problem of recognizing and clustering faces. In this paper, we propose a TV media mining system that is based on a deep convolutional neural network approach, which has been trained with a triplet loss minimization method. The main function of the proposed system is the indexing and clustering of video data for achieving an effective media production analysis of individuals in talk show videos and rapidly identifying a specific individual in video data in real-time processing. Our system uses several face datasets from Labeled Faces in the Wild (LFW), which is a collection of unlabeled web face images, as well as YouTube Faces and talk show faces datasets. In the recognition (person spotting) task, our system achieves an F-measure of 0.996 for the collection of unlabeled web face images dataset and an F-measure of 0.972 for the talk show faces dataset. In the clustering task, our system achieves an F-measure of 0.764 and 0.935 for the YouTube Faces database and the LFW dataset, respectively, while achieving an F-measure of 0.832 for the talk show faces dataset, an improvement of 5.4%, 6.5%, and 8.2% over the previous methods.
In this paper, we propose a novel layout of cameras atop a moving robot to obtain its ego-motion. In particular, we use three cameras in perpendicular setting. This layout offers a better opportunity e.g. compared to collinear settings for studying the trade-off between the accuracy of features to track and a larger field of view. We show by real experiments and synthetic data alike that using the three cameras as a triple is more advantageous when the fields of view of the cameras are slowly changing. In this case, the triple not only provide more accurate features to track but lead also to a more accurate estimation for their 3D construction. On the contrary, for pure rotations, the fields of view are rapidly changing which offers the advantage to dealing with the three cameras as two stereo pairs with a larger field of view. The extended Kalman filter (EKF) is our real-time estimator of the robot pose.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.