2022
DOI: 10.1109/tmm.2021.3102433
|View full text |Cite
|
Sign up to set email alerts
|

Adaptive Semantic-Spatio-Temporal Graph Convolutional Network for Lip Reading

Abstract: The goal of this work is to recognize words, phrases, and sentences being spoken by a talking face without given the audio. Current deep learning approaches for lip reading focus on exploring the appearance and optical flow information of videos. However, these methods do not fully exploit the characteristics of lip motion. In addition to appearance and optical flow, the mouth contour deformation usually conveys significant information that is complementary to others. However, the modeling of dynamic mouth con… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 15 publications
(8 citation statements)
references
References 56 publications
0
8
0
Order By: Relevance
“…Recently, Fenghour et al [43] conducted a survey reviewing deep learning driven VSR methods, including audio-visual datasets, feature extraction, classification networks and classification schemes. However, some essential advances of VSR were omitted, such as self-supervised learning methods [47,48,49,50], cross-modal knowledge distillation methods [34,51,52], graph neural networks backbone architectures [53,54], etc. Chen et al [42] conducted a thoughtful analysis across several representative identity-independent VSG methods and designed a performance evaluation benchmark for VSG.…”
Section: Differences With Related Surveysmentioning
confidence: 99%
See 3 more Smart Citations
“…Recently, Fenghour et al [43] conducted a survey reviewing deep learning driven VSR methods, including audio-visual datasets, feature extraction, classification networks and classification schemes. However, some essential advances of VSR were omitted, such as self-supervised learning methods [47,48,49,50], cross-modal knowledge distillation methods [34,51,52], graph neural networks backbone architectures [53,54], etc. Chen et al [42] conducted a thoughtful analysis across several representative identity-independent VSG methods and designed a performance evaluation benchmark for VSG.…”
Section: Differences With Related Surveysmentioning
confidence: 99%
“…Among them, mouth-centered videos and dense optical flow are regular grid data, so CNNs are the most suitable and commonly used backbone architectures for them. On the other hand, as landmark points are irregular data, some existing works [53,54,118] adopted Graph Convolution Networks (GCNs) to extract visual features from landmark points. Next, we review these backbone architectures.…”
Section: Visual Frontend Networkmentioning
confidence: 99%
See 2 more Smart Citations
“…Surface electromyography [16], [17], vision [18], [19], [20], [21], [22], [23], ultrasound imaging [24], [25], and radar [26], [27], [28], [29], [30], [31] are techniques for capturing nonacoustic speech-related biosignals without the need to place sensors inside the oral cavity. Although these techniques are more convenient than the aforementioned ones, they have some shortcomings.…”
Section: Introductionmentioning
confidence: 99%