2018 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT) 2018
DOI: 10.1109/iceeccot43722.2018.9001505
|View full text |Cite
|
Sign up to set email alerts
|

Convolutional Neural Networks for Predicting Words: A Lip-Reading System

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
2
0
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 12 publications
(5 citation statements)
references
References 9 publications
0
2
0
1
Order By: Relevance
“…Apart from improper datasets and traditional methodologies, a proper combination of DL models and datasets is also deemed essential. This can be shown in [157]- [160] where there is a significant difference in the Training and Test Accuracy of the model. Accurate recognition of spoken words enhances the system's user experience, reliability, and productivity, making it an essential consideration when designing and evaluating VSR systems.…”
Section: E the Vital Combination Of DL Architecture And Proper Datasetsmentioning
confidence: 94%
See 1 more Smart Citation
“…Apart from improper datasets and traditional methodologies, a proper combination of DL models and datasets is also deemed essential. This can be shown in [157]- [160] where there is a significant difference in the Training and Test Accuracy of the model. Accurate recognition of spoken words enhances the system's user experience, reliability, and productivity, making it an essential consideration when designing and evaluating VSR systems.…”
Section: E the Vital Combination Of DL Architecture And Proper Datasetsmentioning
confidence: 94%
“…The network has a deep architecture with 48 layers, including stem layers, inception modules, and fully connected layers. In a study by P. Sindhura et al [157] in 2018, AlexNet was thoroughly evaluated alongside Inception V3 for lip reading. Both models were trained using the MIRACL-VC1 dataset and were analyzed from both speaker-dependent and speaker-independent perspectives.…”
Section: Sequence Of Video Framesmentioning
confidence: 99%
“…The widely used [2,3,5,58,59] GRID audio-visual corpus [60] is utilised for training speech reconstruction models, allowing comparisons with other related research. The GRID corpus includes audio and video recordings of 34 speakers, each with 1000 utterances, with each utterance consisting of a combination of six words categorised into six categories from a 51-word vocabulary.…”
Section: Datasetmentioning
confidence: 99%
“…Te consistency of the results between accuracy and the F1 score is also not far away. For the original dataset, MediaPipe + LRCN with 3 CNN layers has superior results (87%) compared to Inception V3 (86.6%) [48], CNN (52.9%) [48], VGG-16+LSTM [47] (59%), 3D-CNN [51] (70.2), and 3D-CNN + LSTM [52] (85%).…”
mentioning
confidence: 99%