Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Li, Yu; Xue, Feng; Wu, Lin; Xie, Yincen; Li, Shujie

doi:10.1007/s00530-023-01226-3

Multimedia Systems

2024

DOI: 10.1007/s00530-023-01226-3

|View full text |Cite

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Yu Li,

Feng Xue,

Lin Wu

et al.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Article2

Relationship

Self Cite0

Independent2

Authors

Journals

Cited by 2 publications

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Lip Feature Disentanglement for Visual Speaker Authentication in Natural Scenes

He,

Yang,

Wang

et al. 2024

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

Lip Feature Disentanglement for Visual Speaker Authentication in Natural Scenes

He,

Yang,

Wang

et al. 2024

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

HNet: A deep learning based hybrid network for speaker dependent visual speech recognition

Chandrabanshi,

Domnic

2024

International Journal of Hybrid Intelligent Systems

View full text Add to dashboard Cite

Visual Speech Recognition (VSR) is a popular area in computer vision research, attracting interest for its ability to precisely analyze lip motion and seamlessly convert them into textual representation. VSR systems leverage visual features to augment the understanding of automated speech and predict text. VSR finds various applications, including enhancing speech recognition in scenarios with degraded acoustic signals, aiding individuals with hearing impairments, bolstering security by reducing reliance on text based passwords, facilitating biometric authentication for liveness detection, and enabling underwater communications. Despite the various techniques proposed for improving the resilience and precision of automatic speech recognition, VSR still has challenges like homophones of words, gradient descent issues with varying sequence lengths, and lip reading demands accounting for short and long range correlations between consecutive video frames. We have proposed a hybrid network (HNet) with multilayered three dimensional dilated convolution neural network (3D-CNN). The spatio-temporal feature extraction process will be facilitated by dilated 3D-CNN. HNet integrates two bidirectional recurrent neural networks (BiGRU and BiLSTM) to process the feature sequences bidirectionally to establish the temporal relationship. The fusion of BiGRU-BiLSTM capabilities allows the model to process feature sequences more comprehensively and effectively. The proposed work focuses on face based biometric authentication for liveness detection using the VSR model to boost security against face spoofing. The existing face based biometric systems are widely used for individual authentication and verification but are still vulnerable to 3D masks and adversarial attacks. The VSR system can be added to existing face based verification systems as a second level authentication technique to identify a person with liveness. The working ideology of the VSR system will be based on the challenge response technique, where a person has to pronounce the passcode silently displayed on the screen. The VSR model assesses its effectiveness using word error rate (WER), which matches the pronounced passcode to the one presented on the screen. Overall, the proposed work aims to enhance the accuracy of VSR so that it can be combined with existing face based authentication systems. The proposed system outperforms the existing VSR system and obtained 1.3% WER. The significance of the proposed hybrid model is that it efficiently captures temporal dependencies, enhancing context embedding, improving robustness to input variability, reducing information loss, and enhancing performance and accuracy in modeling and analyzing passcode pronunciation patterns.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Cited by 2 publications

References 36 publications

Lip Feature Disentanglement for Visual Speaker Authentication in Natural Scenes

Lip Feature Disentanglement for Visual Speaker Authentication in Natural Scenes

HNet: A deep learning based hybrid network for speaker dependent visual speech recognition

Contact Info

Product

Resources

About