Lip reading, also known as visual speech recognition, has recently received considerable attention. Although advanced feature engineering and powerful deep neural network architectures have been proposed for this task, the performance still cannot be competitive with speech recognition tasks using the audio modality as input. This is mainly because compared with audio, visual features carry less information relevant to word recognition. For example, the voiced sound made while the vocal cords vibrate can be represented by audio but is not reflected by mouth or lip movement. In this paper, we map the sequence of mouth movement images directly to mel-spectrogram to reconstruct the speech relevant information. Our proposed architecture consists of two components: (a) the mel-spectrogram reconstruction front-end which includes an encoder-decoder architecture with attention mechanism to predict mel-spectrogram from videos; (b) the lip reading back-end consisting of convolutional layers, bi-directional gated recurrent units, and connectionist temporal classification loss, which consumes the generated melspectrogram representation to predict text transcriptions. The speaker-dependent evaluation results demonstrate that our proposed model not only generates quality mel-spectrograms but also outperforms state-of-the-art models on the GRID benchmark lip reading dataset, with 0.843% character error rate and 2.525% word error rate.
Target speech separation refers to isolating target speech from a multi-speaker mixture signal by conditioning on auxiliary information about the target speaker. Different from the mainstream audio-visual approaches which usually require simultaneous visual streams as additional input, e.g. the corresponding lip movement sequences, in our approach we propose the novel use of a single face profile of the target speaker to separate expected clean speech. We exploit the fact that the image of a face contains information about the person's speech sound. Compared to using a simultaneous visual sequence, a face image is easier to obtain by pre-enrollment or on websites, which enables the system to generalize to devices without cameras. To this end, we incorporate face embeddings extracted from a pretrained model for face recognition into the speech separation, which guide the system in predicting a target speaker mask in the time-frequency domain. The experimental results show that a pre-enrolled face image is able to benefit separating expected speech signals. Additionally, face information is complementary to voice reference and we show that further improvement can be achieved when combing both face and voice embeddings 1 .
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.