Survey on automatic lip-reading in the era of deep learning

Fernandez-Lopez, Adriana; Sukno, Federico M.

doi:10.1016/j.imavis.2018.07.002

Cited by 104 publications

(77 citation statements)

References 115 publications

(203 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In such environments, there is a need to build a speech recognizer that can use both audio and facial visual information. It would also be necessary to have a speech recognizer that would solely use visual information and does not depend on audio input (Fernandez-Lopez & Sukno, 2018). i.e., such speech recognizer should extract features from the facial parts, especially lip movement, and facial expressions.…”

Section: Literature Surveymentioning

confidence: 99%

“…i.e., such speech recognizer should extract features from the facial parts, especially lip movement, and facial expressions. Several attempts have been made to detect speech from the lip movement by (Almajai, Cox, Harvey, & Lan, 2016;Chung, Senior, Vinyals, & Zisserman, 2017;Dupont & Luettin, 2000;Petridis & Pantic, 2016;Sui, Bennamoun, & Togneri, 2015;Wand, Koutnik, & Schmidhuber, 2016;Yau, Kumar, & Weghorn, 2007;Zhou, Zhao, Hong, & Pietikäinen, 2014), but the results are much low as compared with audio speech recognizers (Fernandez-Lopez & Sukno, 2018).…”

Section: Literature Surveymentioning

confidence: 99%

“…But some of the authors argue that a sequence of words appearing in a sentence has a pattern and such sequence of words can resolve the mapping of lip movement patterns to a word (Afouras, Chung, & Zisserman, 2018;Assael, Shillingford, Whiteson, & de Freitas, 2016;. Therefore, the autonomous lip-reading system with high accuracy remains to be challenging one (Fernandez-Lopez & Sukno, 2018 poor video resolution and low frame rates of videos (Buchan, Paré, & Munhall, 2007;Hilder et al, 2009;Ortiz, 2008).…”

Section: Literature Surveymentioning

confidence: 99%

“…The recent works have used Recurrent neural network/ LSTM to do the same. When their results are compared, the Markova modules perform better than LSTM, and it is mainly due to the unavailability of large data repositories (Fernandez-Lopez & Sukno, 2018). All the earlier research works on the lip-reading system have three steps in common (Fernandez-Lopez & Sukno, 2018).…”

Section: Literature Surveymentioning

confidence: 99%

See 3 more Smart Citations

LSTM Based Lip Reading Approach for Devanagiri Script

Patil

Chickerur²,

Meti

et al. 2019

ADCAIJ

View full text Add to dashboard Cite

Speech Communication in a noisy environment is a difficult and challenging task. Many professionals work in noisy environments like aviation, constructions, or manufacturing, and find it difficult to communicate orally. Such noisy environments need an automated lipreading system that could be helpful in communicating some instructions and commands. This paper proposes a novel lipreading solution, which extracts the geometrical shape of lip movement from the video and predicts the words/ sentences spoken. An Indian specific language data set is developed which consists of lip movement information captured from 50 persons. This includes students in the age group of 18 to 20 years and faculty in the age group of 25 to 40 years. All have spoken a paragraph of 58 words within 10 sentences in Hindi (Devanagari, spoken in India) language which was recorded under various conditions. The implementation consists of facial parts detection, along with Long short term memory's. The proposed solution is able to predict the words spoken with 77% and 35% accuracy for data set of 3 and 10 words respectively. The sentences are predicted with 20% accuracy, which is encouraging.

show abstract

Section: Literature Surveymentioning

confidence: 99%

Section: Literature Surveymentioning

confidence: 99%

Section: Literature Surveymentioning

confidence: 99%

Section: Literature Surveymentioning

confidence: 99%

See 2 more Smart Citations

LSTM Based Lip Reading Approach for Devanagiri Script

Patil

Chickerur²,

Meti

et al. 2019

ADCAIJ

View full text Add to dashboard Cite

show abstract

“…The field of visual speech recognition (VSR), or lipreading, has witnessed dramatic breakthroughs recently, primarily due to the paradigm shift from hand-crafted features to deep learning based models [1][2][3][4][5][6][7][8], coupled with the public release of large suitable corpora in a variety of environments [9][10][11][12][13][14][15], as also reviewed in [16,17]. Such models however, while reducing recognition errors compared to previous approaches, are not as efficient to compute and store.…”

Section: Introductionmentioning

confidence: 99%

MobiLipNet: Resource-Efficient Deep Learning Based Lipreading

Koumparoulis

Potamianos

2019

Interspeech 2019

View full text Add to dashboard Cite

Recent works in visual speech recognition utilize deep learning advances to improve accuracy. Focus however has been primarily on recognition performance, while ignoring the computational burden of deep architectures. In this paper we address these issues concurrently, aiming at both high computational efficiency and recognition accuracy in lipreading. For this purpose, we investigate the MobileNet convolutional neural network architectures, recently proposed for image classification. In addition, we extend the 2D convolutions of MobileNets to 3D ones, in order to better model the spatio-temporal nature of the lipreading problem. We investigate two architectures in this extension, introducing the temporal dimension as part of either the depthwise or the pointwise MobileNet convolutions. To further boost computational efficiency, we also consider using pointwise convolutions alone, as well as networks operating on half the mouth region. We evaluate the proposed architectures on speaker-independent visual-only continuous speech recognition on the popular TCD-TIMIT corpus. Our best system outperforms a baseline CNN by 4.27% absolute in word error rate and over 12 times in computational efficiency, whereas, compared to a state-of-the-art ResNet, it is 37 times more efficient at a minor 0.07% absolute error rate degradation.

show abstract