Social behavioral biometrics investigates social interactions to determine a person's identity. Within the discipline of social behavioral biometrics, recognition of individuals based on their aesthetic preferences is an emerging direction of research. Human aesthetic is a soft, behavioral biometric trait that refers to a person's attitudes towards a particular subject material. Recent developments in aesthetic-based biometric systems have proven that an individual's visual and audio aesthetic preferences hold considerable distinctive features. This paper introduces a novel three-stage audio-aesthetic system that can uniquely identify a user from the set of their favorite songs. The system utilizes Residual Network (ResNet) for highlevel feature extraction. A hybrid meta-heuristic feature selection algorithm based on Cuckoo Search and Whale Optimization is proposed for feature extraction optimization, which results in the low-dimensional feature set. The selected subset of features is fed into the XGBoost classifier to establish a person's identity. The proposed method outperformed the handcrafted feature-based method by achieving 99.54% accuracy on a proprietary dataset (Free Music Archive) and 99.79% accuracy on a publicly available dataset (Million Playlists Dataset).INDEX TERMS Social behavioral biometrics, deep learning, biometric authentication, audio aesthetics, transfer learning, meta-heuristic, feature selection.
Human aesthetics play a significant role in video game development, emotional‐aware robot design, online recommender systems, digital human, and other domains of research focusing on human‐computer interactions. Social network user recognition based on aesthetic preferences is an emerging research domain. In this paper, a novel deep learning architecture is proposed for multi‐modal audio‐visual person identification that combines audio and visual aesthetic features. A pre‐trained ResNet architecture is utilized to extract high‐level features from a set of user‐preferred audio and image samples. A novel deep learning‐based fusion technique called residual‐aided intermediate fusion (RAIF) is introduced in order to effectively merge the audio and visual features. The proposed RAIF method achieved an accuracy of 98% and a loss of 0.01 on a proprietary multi‐modal dataset, indicating its effectiveness in fusing audio and visual information.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.