Multimodal speech recognition: increasing accuracy using high speed video data

Ivanko, Denis; Karpov, Alexey; Fedotov, Dmitrii; Kipyatkova, Irina; Ryumin, Dmitry; Ivanko, Dmitriy; Minker, Wolfgang; Železný, Miloš

doi:10.1007/s12193-018-0267-1

Cited by 25 publications

(12 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They are created for different purposes and with different means. Works [28], [29] contain a comprehensive list and analysis of such databases from the audio-visual speech recognition point of view.…”

Section: Multimodal Corpora For Audio-visual Speech Recognition Inmentioning

confidence: 99%

Multimodal Corpus Design for Audio-Visual Speech Recognition in Vehicle Cabin

et al. 2021

Self Cite

View full text Add to dashboard Cite

This paper introduces a new methodology aimed at comfort for the driver in-the-wild multimodal corpus creation for audiovisual speech recognition in driver monitoring systems. The presented methodology is universal and can be used for corpus recording for different languages. We present an analysis of speech recognition systems and voice interfaces for driver monitoring systems based on the analysis of both audio and video data. Multimodal speech recognition allows using audio data when video data are useless (e.g. at nighttime), as well as applying video data in acoustically noisy conditions (e.g., at highways). Our methodology identifies the main steps and requirements for multimodal corpus designing, including the development of a new framework for audiovisual corpus creation. We identify the main research questions related to the speech corpus creation task and discuss them in detail in this paper. We also consider some main cases of usage that require speech recognition in a vehicle cabin for interaction with a driver monitoring system. We also consider other important use cases when the system detects dangerous states of driver's drowsiness and starts a question-answer game to prevent dangerous situations. At the end based on the proposed methodology, we developed a mobile application that allows us to record a corpus for the Russian language. We created RUSAVIC corpus using the developed mobile application that at the moment a unique audiovisual corpus for the Russian language that is recorded in-the-wild condition. INDEX TERMS Driver monitoring, automatic speech recognition, multimodal corpus, human-computer interaction.

show abstract

Section: Multimodal Corpora For Audio-visual Speech Recognition Inmentioning

confidence: 99%

Multimodal Corpus Design for Audio-Visual Speech Recognition in Vehicle Cabin

et al. 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…As well as commonly used visual features extraction algorithms and visual speech modeling methods. In addition, we started with investigation of region-of-interest (ROI) detection approaches (Ivanko et al, 2018a). We found out that Active Appearance Models-based and Haar-like features-based methods most widely used for this purpose.…”

Section: Backgrounds and Related Researchmentioning

confidence: 99%

A Novel Task-Oriented Approach Toward Automated Lip-Reading System Implementation

Ivanko

Ryumin

2021

Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci.

View full text Add to dashboard Cite

Abstract. Visual information plays a key role in automatic speech recognition (ASR) when audio is corrupted by background noise, or even inaccessible. Speech recognition using visual information is called lip-reading. The initial idea of visual speech recognition comes from humans’ experience: we are able to recognize spoken words from the observation of a speaker's face without or with limited access to the sound part of the voice. Based on the conducted experimental evaluations as well as on analysis of the research field we propose a novel task-oriented approach towards practical lip-reading system implementation. Its main purpose is to be some kind of a roadmap for researchers who need to build a reliable visual speech recognition system for their task. In a rough approximation, we can divide the task of lip-reading into two parts, depending on the complexity of the problem. First, if we need to recognize isolated words, numbers or small phrases (e.g. Telephone numbers with a strict grammar or keywords). Or second, if we need to recognize continuous speech (phrases or sentences). All these stages disclosed in detail in this paper. Based on the proposed approach we implemented from scratch automatic visual speech recognition systems of three different architectures: GMM-CHMM, DNN-HMM and purely End-to-end. A description of the methodology, tools, step-by-step development and all necessary parameters are disclosed in detail in current paper. It is worth noting that for the Russian speech recognition, such systems were created for the first time.

show abstract

“…There are several state-of-the-art methods for model training. Initially, the most widespread methods were based on the use of hidden Markov models (HMM) for visual speech recognition and their coupled or multistream versions for audio-visual speech recognition [23]. However, at present, the approaches based on the use of neural networks of different architectures have become increasingly popular [24].…”

Section: Backgroundsmentioning

confidence: 99%

Developing of a Software–Hardware Complex for Automatic Audio–Visual Speech Recognition in Human–Robot Interfaces

Ivanko

Ryumin

Karpov

2021

Electromechanics and Robotics

Self Cite

View full text Add to dashboard Cite

In recent years, audio speech has become more and more popular and often used in modern human-robot interfaces. Such natural form of communication is highly appreciated by users. There is no doubt that in the nearest future, alongside with the technology development, we will encounter the development of such "native" human-robot interfaces. In this paper, we propose the architecture and develop the software-hardware complex designed for automatic speech recognition with a dictionary of small and medium size and to be used in robots. A distinctive feature of the developed software-hardware complex is the presence of an audiovisual speech synchronization module, which allows both (1) to detect a speech signal in audio data and (2) to take into account the natural asynchrony between acoustic and visual speech. Based on this, it is possible (3) to synchronize the speech sections of audio and video streams in time. Another distinctive feature is the presence of a modality combining module, which allows (1) to combine informative data from audio and video signals and (2) to adjust the weights of each modality depending on the SNR level, which allows achieving optimal recognition accuracy even in acoustically noisy conditions.

show abstract

Multimodal speech recognition: increasing accuracy using high speed video data

Cited by 25 publications

References 33 publications

Multimodal Corpus Design for Audio-Visual Speech Recognition in Vehicle Cabin

Multimodal Corpus Design for Audio-Visual Speech Recognition in Vehicle Cabin

A Novel Task-Oriented Approach Toward Automated Lip-Reading System Implementation

Developing of a Software–Hardware Complex for Automatic Audio–Visual Speech Recognition in Human–Robot Interfaces

Contact Info

Product

Resources

About