2022
DOI: 10.3390/s22093597
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC

Abstract: Concomitant with the recent advances in deep learning, automatic speech recognition and visual speech recognition (VSR) have received considerable attention. However, although VSR systems must identify speech from both frontal and profile faces in real-world scenarios, most VSR studies have focused solely on frontal face pictures. To address this issue, we propose an end-to-end sentence-level multi-view VSR architecture for faces captured from four different perspectives (frontal, 30°, 45°, and 60°). The encod… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(6 citation statements)
references
References 53 publications
0
2
0
Order By: Relevance
“…The current state of the art is 98.31% [93]. The results on this dataset tend to be much higher than those collected in the wild due to the limitation of vocabulary (digit sequences and assigned phrases) as well as the constrained lab scenario the dataset was collected in.…”
Section: Ouluvs2mentioning
confidence: 97%
See 1 more Smart Citation
“…The current state of the art is 98.31% [93]. The results on this dataset tend to be much higher than those collected in the wild due to the limitation of vocabulary (digit sequences and assigned phrases) as well as the constrained lab scenario the dataset was collected in.…”
Section: Ouluvs2mentioning
confidence: 97%
“…Year Language SOTA Accuracy Speech Scenario GRID [90] 2006 English 98.7% [91] Structured sentences GRID-Lombard [99] 2018 English N/A Structured sentences OuluVS2 [92] 2015 English 98.31% [93] Controlled sentences MODALITY [94] 2017 English 54.00% [94] Controlled sentences LRS [45] 2017 English 49.8% [45] TV interviews MV-LRS [84] 2017 English 47.2% [84] TV programs LRS2-BBC [95] 2018 English 64.8% [96] TV programs LRS3-TED [97] 2018 English 63.7% [98] Formal lectures CMLR [101] 2019 Mandarin 67.52% [101] TV programs LSVSR [100] 2018 English 59.1% [100] YouTube videos YTDEV18 [102] 2019 English N/A YouTube videos SynthVSR [103] 2023 English N/A Synthetic data…”
Section: Datasetmentioning
confidence: 99%
“…Additionally, following the "transition" structure shown in Figure S1A(b) (Table S1), a standard dropout layer was connected, and pixels were randomly dropped to prevent strong correlation in feature maps between successive frames [39]. In addition, the spatial dropout layer connected to the "transition" structure, shown in Figure S2C, was effectively used to extract fine movement features such as lips, teeth, and tongue with strong spatial correlation [31][32][33][34][35][36][37][38][39][40][41][42][43]. Therefore, the proposed dense spatial-temporal CNN network comprises one layer that represents a nonlinear transformation Hl, and the output of the layer can be expressed as x l (3), where x 0 , x 1 , ÁÁÁ, and x (l-1) denote the volume of the 3D feature created in the previous layer and [ÁÁÁ] denotes a concatenation operation.…”
Section: Visual Modulementioning
confidence: 99%
“…Many studies initially centered on 2D fully convolutional networks [4,5]. But as hardware improved, 3D convolutions over 2D convolutions [6][7][8] or recurrent neural networks [9,10] quickly became an option for more effective use of temporal information. This concept has developed into a specific architecture consisting of two parts: the front end and the back end.…”
Section: Related Workmentioning
confidence: 99%