Convolutional Neural Networks for Predicting Words: A Lip-Reading System

Sindhura, PV; Preethi, S.; Niranjana, Krupa B

doi:10.1109/iceeccot43722.2018.9001505

Cited by 12 publications

(5 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Apart from improper datasets and traditional methodologies, a proper combination of DL models and datasets is also deemed essential. This can be shown in [157]- [160] where there is a significant difference in the Training and Test Accuracy of the model. Accurate recognition of spoken words enhances the system's user experience, reliability, and productivity, making it an essential consideration when designing and evaluating VSR systems.…”

Section: E the Vital Combination Of DL Architecture And Proper Datasetsmentioning

confidence: 94%

“…The network has a deep architecture with 48 layers, including stem layers, inception modules, and fully connected layers. In a study by P. Sindhura et al [157] in 2018, AlexNet was thoroughly evaluated alongside Inception V3 for lip reading. Both models were trained using the MIRACL-VC1 dataset and were analyzed from both speaker-dependent and speaker-independent perspectives.…”

Section: Sequence Of Video Framesmentioning

confidence: 99%

See 1 more Smart Citation

Deep Learning-Based Holistic Speaker Independent Visual Speech Recognition

Nemani¹,

Krishna²,

Ramisetty³

et al. 2023

IEEE Trans. Artif. Intell.

View full text Add to dashboard Cite

Speaker-independent visual speech recognition (VSR) is a complex task that involves identifying spoken words or phrases from video recordings of a speaker's facial movements. Decoding the intricate visual dynamics of a speaker's mouth in a high-dimensional space is a significant challenge in this field. To address this challenge, researchers have employed advanced techniques that enable machines to recognize human speech through visual cues automatically. Over the years, there has been a considerable amount of research in the field of VSR involving different algorithms and datasets to evaluate system performance. These efforts have resulted in significant progress in developing effective VSR models, creating new opportunities for further research in this area. This survey provides a detailed examination of the progression of VSR over the past three decades, with a particular emphasis on the transition from speaker-dependent to speaker-independent systems. We also provide a comprehensive overview of the various datasets used in VSR research and the preprocessing techniques employed to achieve speaker independence. The survey covers the works published from 1990 to 2023, thoroughly analyzing each work and comparing them on various parameters. This survey provides an in-depth analysis of speaker-independent VSR systems evolution from 1990 to 2023. It outlines the development of VSR systems over time and highlights the need to develop end-to-end pipelines for speakerindependent VSR. The pictorial representation offers a clear and concise overview of the techniques used in speaker-independent VSR, thereby aiding in the comprehension and analysis of the various methodologies. The survey also highlights the strengths and limitations of each technique and provides insights into developing novel approaches for analyzing visual speech cues. Overall, This comprehensive review provides insights into the current state-of-the-art speaker-independent VSR and highlights potential areas for future research.

show abstract

Section: E the Vital Combination Of DL Architecture And Proper Datasetsmentioning

confidence: 94%

Section: Sequence Of Video Framesmentioning

confidence: 99%

Deep Learning-Based Holistic Speaker Independent Visual Speech Recognition

Nemani¹,

Krishna²,

Ramisetty³

et al. 2023

IEEE Trans. Artif. Intell.

View full text Add to dashboard Cite

show abstract

“…The widely used [2,3,5,58,59] GRID audio-visual corpus [60] is utilised for training speech reconstruction models, allowing comparisons with other related research. The GRID corpus includes audio and video recordings of 34 speakers, each with 1000 utterances, with each utterance consisting of a combination of six words categorised into six categories from a 51-word vocabulary.…”

Section: Datasetmentioning

confidence: 99%

Lip2Speech: Lightweight Multi-Speaker Speech Reconstruction with Gabor Features

Dong,

Xu,

Abel

et al. 2024

Applied Sciences

View full text Add to dashboard Cite

In environments characterised by noise or the absence of audio signals, visual cues, notably facial and lip movements, serve as valuable substitutes for missing or corrupted speech signals. In these scenarios, speech reconstruction can potentially generate speech from visual data. Recent advancements in this domain have predominantly relied on end-to-end deep learning models, like Convolutional Neural Networks (CNN) or Generative Adversarial Networks (GAN). However, these models are encumbered by their intricate and opaque architectures, coupled with their lack of speaker independence. Consequently, achieving multi-speaker speech reconstruction without supplementary information is challenging. This research introduces an innovative Gabor-based speech reconstruction system tailored for lightweight and efficient multi-speaker speech restoration. Using our Gabor feature extraction technique, we propose two novel models: GaborCNN2Speech and GaborFea2Speech. These models employ a rapid Gabor feature extraction method to derive lowdimensional mouth region features, encompassing filtered Gabor mouth images and low-dimensional Gabor features as visual inputs. An encoded spectrogram serves as the audio target, and a Long Short-Term Memory (LSTM)-based model is harnessed to generate coherent speech output. Through comprehensive experiments conducted on the GRID corpus, our proposed Gabor-based models have showcased superior performance in sentence and vocabulary reconstruction when compared to traditional end-to-end CNN models. These models stand out for their lightweight design and rapid processing capabilities. Notably, the GaborFea2Speech model presented in this study achieves robust multi-speaker speech reconstruction without necessitating supplementary information, thereby marking a significant milestone in the field of speech reconstruction.

show abstract

“…Te consistency of the results between accuracy and the F1 score is also not far away. For the original dataset, MediaPipe + LRCN with 3 CNN layers has superior results (87%) compared to Inception V3 (86.6%) [48], CNN (52.9%) [48], VGG-16+LSTM [47] (59%), 3D-CNN [51] (70.2), and 3D-CNN + LSTM [52] (85%).…”

mentioning

confidence: 99%

Indonesian Lip‐Reading Detection and Recognition Based on Lip Shape Using Face Mesh and Long‐Term Recurrent Convolutional Network

Aripin,

Setiawan

2024

Applied Computational Intelligence and Soft Computing

View full text Add to dashboard Cite

Communication through speech can be hindered by environmental noise, prompting the need for alternative methods such as lip reading, which bypasses auditory challenges. However, the accurate interpretation of lip movements is impeded by the uniqueness of individual lip shapes, necessitating detailed analysis. In addition, the development of an Indonesian dataset addresses the lack of diversity in existing datasets, predominantly in English, fostering more inclusive research. This study proposes an enhanced lip-reading system trained using the long-term recurrent convolutional network (LRCN) considering eight different types of lip shapes. MediaPipe Face Mesh precisely detects lip landmarks, enabling the LRCN model to recognize Indonesian utterances. Experimental results demonstrate the effectiveness of the approach, with the LRCN model with three convolutional layers (LRCN-3Conv) achieving 95.42% accuracy for word test data and 95.63% for phrases, outperforming the convolutional long short-term memory (Conv-LSTM) method. The proposed approach outperforms Conv-LSTM in terms of accuracy. Furthermore, the evaluation of the original MIRACL-VC1 dataset also produced a best accuracy of 90.67% on LRCN-3Conv compared to previous studies in the word-labeled class. The success is attributed to MediaPipe Face Mesh detection, which facilitates the accurate detection of the lip region. Leveraging advanced deep learning techniques and precise landmark detection, these findings promise improved communication accessibility for individuals facing auditory challenges.

show abstract

Convolutional Neural Networks for Predicting Words: A Lip-Reading System

Cited by 12 publications

References 9 publications

Deep Learning-Based Holistic Speaker Independent Visual Speech Recognition

Deep Learning-Based Holistic Speaker Independent Visual Speech Recognition

Lip2Speech: Lightweight Multi-Speaker Speech Reconstruction with Gabor Features

Indonesian Lip‐Reading Detection and Recognition Based on Lip Shape Using Face Mesh and Long‐Term Recurrent Convolutional Network

Contact Info

Product

Resources

About