Audio-visual intent-to-speak detection for human-computer interaction

Cuetos, Philippe de; Neti, C.; Senior, Andrew W.

doi:10.1109/icassp.2000.859318

Cited by 23 publications

(11 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Progress in addressing some or all of these questions can also benefit other areas where joint audio and visual speech processing is suitable [139], such as speaker identification and verification [49], [66], [109], [136], [140][141][142], visual text-to-speech [143][144][145] speech event detection [146] video indexing and retrieval [147], speech enhancement [102], [104], coding [148], signal separation [149], [150], and speaker localization [151][152][153]. Improvements in these areas will result in more robust and natural humancomputer interaction.…”

Section: Summary and Discussionmentioning

confidence: 99%

Recent advances in the automatic recognition of audiovisual speech

et al. 2003

Self Cite

View full text Add to dashboard Cite

Abstract-Visual speech information from the speaker's mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability into the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the later topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audiovisual speaker adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small-to largevocabulary recognition tasks, recorded at both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, however less so for visually challenging environments and large vocabulary tasks.

show abstract

Section: Summary and Discussionmentioning

confidence: 99%

Recent advances in the automatic recognition of audiovisual speech

et al. 2003

Self Cite

View full text Add to dashboard Cite

show abstract

“…In this case, visual information could be very useful since it is completely independent of the acoustic environment 2 . For instance, in a previous study, De Cueto et al (2000) used a basic Visual Voice Activity Detector (V-VAD) for detecting a speaker's speech activity in front of a computer. For this, either specific lip parameters or the average luminance of the mouth picture can be used (Iyengar and Neti, 2001).…”

Section: Application To Automatic Voice Activity Detectionmentioning

confidence: 99%

A study of lip movements during spontaneous dialog and its application to voice activity detection

Sodoyer

Rivet

Girin

et al. 2009

The Journal of the Acoustical Society of America

View full text Add to dashboard Cite

Running title: Voice activity detection based on lip movementsThis paper presents a quantitative and comprehensive study of the lip movements of a given speaker in different speech / non speech contexts, with a particular focus on silences (i.e., when no sound is produced by the speaker). The aim is to characterize the relationship between "lip activity" and "speech activity", and then to use visual speech information as a Voice Activity Detector (VAD). To this aim, an original audio-visual corpus was recorded with two speakers involved in a face-to-face spontaneous dialog, although being in separate rooms. Each speaker communicated with the other using a microphone, a camera, a screen, and headphones. This system was used to capture separate audio stimuli for each speaker and to monitor each

show abstract

“…Nevertheless, even with the existence of this complexity-effectiveness trade-off, numerous systems have attempted to fuse multi-modal information for a variety of applications. Examples include audio-visual speech recognition systems employing a single camera and a microphone, resulting in a higher speech recognition accuracy rate, and a greater robustness to noise [5][6][7][8][9][10][11][12][13][14][15]. Other applications include audio-visual sound localization [2,3], where a speaker is localized visually using multiple cameras and acoustically using multiple microphones.…”

Section: Literature Reviewmentioning

confidence: 99%