2021
DOI: 10.3390/s22010072
|View full text |Cite
|
Sign up to set email alerts
|

Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition

Abstract: In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist when using VSR systems. A major challenge is the distinction of words with similar pronunciation, called homophones; these lead to word ambiguity. Another technical limitation of traditional VSR systems is that visua… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
2

Relationship

1
8

Authors

Journals

citations
Cited by 22 publications
(9 citation statements)
references
References 41 publications
0
9
0
Order By: Relevance
“…Additionally, following the "transition" structure shown in Figure S1A(b) (Table S1), a standard dropout layer was connected, and pixels were randomly dropped to prevent strong correlation in feature maps between successive frames [39]. In addition, the spatial dropout layer connected to the "transition" structure, shown in Figure S2C, was effectively used to extract fine movement features such as lips, teeth, and tongue with strong spatial correlation [31][32][33][34][35][36][37][38][39][40][41][42][43]. Therefore, the proposed dense spatial-temporal CNN network comprises one layer that represents a nonlinear transformation Hl, and the output of the layer can be expressed as x l (3), where x 0 , x 1 , ÁÁÁ, and x (l-1) denote the volume of the 3D feature created in the previous layer and [ÁÁÁ] denotes a concatenation operation.…”
Section: Visual Modulementioning
confidence: 99%
“…Additionally, following the "transition" structure shown in Figure S1A(b) (Table S1), a standard dropout layer was connected, and pixels were randomly dropped to prevent strong correlation in feature maps between successive frames [39]. In addition, the spatial dropout layer connected to the "transition" structure, shown in Figure S2C, was effectively used to extract fine movement features such as lips, teeth, and tongue with strong spatial correlation [31][32][33][34][35][36][37][38][39][40][41][42][43]. Therefore, the proposed dense spatial-temporal CNN network comprises one layer that represents a nonlinear transformation Hl, and the output of the layer can be expressed as x l (3), where x 0 , x 1 , ÁÁÁ, and x (l-1) denote the volume of the 3D feature created in the previous layer and [ÁÁÁ] denotes a concatenation operation.…”
Section: Visual Modulementioning
confidence: 99%
“…This proposed architecture [12] has been very effective. It has achieved a 17.5% recognizable improvement in accuracy in the lip-reading datasets LRW [13] and LRW-1000 [14], and even now, many state-of-the-art lip-reading solutions still use it as its meta-architecture [15][16][17][18]. Following the level of recognized speech, in this section, we present related work in the current research area by distributing it into four sub-sections:…”
Section: Related Workmentioning
confidence: 99%
“…Therefore, the end-to-end deep learning architecture is the natural direction of scientific research as opposed to the previous manual feature extraction classification methodology for automatic lip-reading technology. Researchers use convolutional neural networks (CNN) to focus on interest regions in the last few years, and they have also been quite successful at classifying images and detecting targets [8], [9].…”
Section: Article Historymentioning
confidence: 99%