A novel hybrid deep learning model for audio-visual source separation is introduced in this paper, with a specific focus on the precise isolation of a particular speaker's voice from video content. By leveraging both audio and visual characteristics, the achievement of accurate separation of the targeted speech signal is facilitated by our model. Notably, the incorporation of the speaker's facial expressions as an auxiliary cue for enhancing the extraction of their unique vocal qualities is emphasized. Proficiency in audio-visual speech separation and latent representations of distinctive speaker attributes, known as speaker embeddings, is simultaneously acquired by our model through unsupervised learning on unannotated video data. The model employed in this study is speaker-independent, wherein an initial stage of feature extraction is conducted for both audio and visual inputs prior to the subsequent deep modal analysis. The utilization of facial attribute features as an identifying code enables the identification of the speaker's frequency space or other audio properties. The model's efficacy was assessed through evaluation on the widely recognized AVspeech dataset yielding an improvement of 7.7 in terms of source-to-distortion ratio (SDR).