Lip-reading is a process of interpreting speech by visually analysing lip movements. Recent research in this area has shifted from simple word recognition to lip-reading sentences in the wild. This paper attempts to use phonemes as a classification schema for lip-reading sentences to explore an alternative schema and to enhance system performance. Different classification schemas have been investigated, including characterbased and visemes-based schemas. The visual front-end model of the system consists of a Spatial-Temporal (3D) convolution followed by a 2D ResNet. Transformers utilise multi-headed attention for phoneme recognition models. For the language model, a Recurrent Neural Network is used. The performance of the proposed system has been testified with the BBC Lip Reading Sentences 2 (LRS2) benchmark dataset. Compared with the state-of-the-art approaches in lip-reading sentences, the proposed system has demonstrated an improved performance by a 10% lower word error rate on average under varying illumination ratios. K E Y W O R D S deep learning, deep neural networks, lip-reading, phoneme-based lip-reading, spatial-temporal convolution, transformers This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.