<p>Facial kinship verification refers to automatically determining whether
two people have a kin relation from their faces. It has become a popular
research topic due to potential practical applications, such as finding missing
children, family photo organization, or criminal investigations. Over the past
decade, many efforts have been devoted to improving the verification
performance of human faces only while lacking other biometric information, e.g., speaking voice. In this paper, to
interpret and benefit from multiple modalities, we propose for the first time
to combine human faces and voices to verify kinship, which we refer it as the audio-visual
kinship verification study. Since there is still no standard and public audiovisual kinship
dataset, we first establish a comprehensive audio-visual kinship dataset that
consists of familial talking facial videos under various scenarios, called TALKIN-Family. Based on the dataset, we present
the extensive evaluation of kinship verification from faces and voices. In
particular, we propose a deep learning-based fusion method, named Unified
Adaptive Adversarial Multimodal Learning (UAAML). It consists of the
adversarial network and the attention module on the basis of unified
multi-modal features. First, the modality adversarial learning eliminates the cross-modality variations by confusing
the discriminator. The attention module quantifies the importance of kinship
interested features. The overall multimodal fusion network is trained in
Siamese fashion to encourage the compactness of kinship and separation of
non-kinship. Experiments show that audio (voice) information is complementary
to facial features and useful for the kinship verification problem. Further,
the proposed fusion method outperforms baseline methods. In addition, we also
evaluate the human kinship verification ability on a sub-set of TALKIN-Family.
It indicates that human has higher accuracy when they have access to both faces
and voice. The machine learning methods could effectively and efficiently
outperform human ability. Finally, we include the future work and research
opportunities with the TALKIN-Family dataset.</p>