Knowing who to listen to in speech recognition: visually guided beamforming

Bub, U.; Hunke, M.; Waibel, Alex

doi:10.1109/icassp.1995.479827

Cited by 16 publications

(11 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Progress in addressing some or all of these questions can also benefit other areas where joint audio and visual speech processing is suitable [139], such as speaker identification and verification [49], [66], [109], [136], [140][141][142], visual text-to-speech [143][144][145] speech event detection [146] video indexing and retrieval [147], speech enhancement [102], [104], coding [148], signal separation [149], [150], and speaker localization [151][152][153]. Improvements in these areas will result in more robust and natural humancomputer interaction.…”

Section: Summary and Discussionmentioning

confidence: 99%

Recent advances in the automatic recognition of audiovisual speech

et al. 2003

View full text Add to dashboard Cite

Abstract-Visual speech information from the speaker's mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability into the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the later topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audiovisual speaker adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small-to largevocabulary recognition tasks, recorded at both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, however less so for visually challenging environments and large vocabulary tasks.

show abstract

Section: Summary and Discussionmentioning

confidence: 99%

Recent advances in the automatic recognition of audiovisual speech

et al. 2003

View full text Add to dashboard Cite

show abstract

“…A reliable face and mouth tracker could provide the tracking necessary for such a "lipreading" system. It has been shown that a more accurate localization in space can be delivered visually than acoustically [2].…”

Section: Introductionmentioning

confidence: 99%

Tracking Human Faces in Real-Time,

Yang¹,

Waibel²

1995

View full text Add to dashboard Cite

“…16 The clear geometrical representation of the problem makes it a favorite feature to be used when approaching such a task by a machine setup. 9,11,12,[14][15][16][17][18] Another cue known to have notable importance in human dimensional hearing is the interaural level differences ͑ILDs͒. Surprisingly ILDs have seldom been used in actual system implementations because they are believed to have unfavorable frequency dependence and unreliability.…”

Section: A Sound Localization By Machinementioning

confidence: 98%

“…[9][10][11][12] In other studies a human model has been followed to some degree, resulting in constraints in applicability and limited accuracy. 13 A significant amount of work has been devoted to devices with a limited functionality ͑e.g., constrained to localization in a single half-plane while still using large sensor structures͒ [12][13][14] or the help of a nonacoustical modality has been used ͑e.g., vision͒. 14 In contrast to large, fixed sensor arrays for special situations and environments, this work concentrates on a compact, mobile sensor array that is suited for a mobile robot to localize 3D sound sources with moderate accuracy.…”

Section: A Sound Localization By Machinementioning

confidence: 99%

“…13 A significant amount of work has been devoted to devices with a limited functionality ͑e.g., constrained to localization in a single half-plane while still using large sensor structures͒ [12][13][14] or the help of a nonacoustical modality has been used ͑e.g., vision͒. 14 In contrast to large, fixed sensor arrays for special situations and environments, this work concentrates on a compact, mobile sensor array that is suited for a mobile robot to localize 3D sound sources with moderate accuracy. It can be positioned arbitrarily in space while being capable of identifying the relative position of an arbitrarily located sound source.…”

Section: A Sound Localization By Machinementioning

confidence: 99%

See 1 more Smart Citation

Three-dimensional sound localization from a compact non-coplanar array of microphones using tree-based learning

Weng

Guentchev

2001

The Journal of the Acoustical Society of America

View full text Add to dashboard Cite

One of the various human sensory capabilities is to identify the direction of perceived sounds. The goal of this work is to study sound source localization in three dimensions using some of the most important cues the human uses. In an attempt to satisfy the requirements of portability and miniaturization in robotics, this approach employs a compact sensor structure that can be placed on a mobile platform. The objective is to estimate the relative sound source position in three-dimensional space without imposing excessive restrictions on its spatio-temporal characteristics and the environment structure. Two types of features are considered, interaural time and level differences. Their relative effectiveness for localization is studied, as well as a practical way of using these complementary parameters. A two-stage procedure was used. In the training stage, sound samples are produced from points with known coordinates and then are stored. In the recognition stage, unknown sounds are processed by the trained system to estimate the 3D location of the sound source. Results from the experiments showed under +/-3 degrees in average angular error and less than +/-20% in average radial distance error.

show abstract

Knowing who to listen to in speech recognition: visually guided beamforming

Cited by 16 publications

References 6 publications

Recent advances in the automatic recognition of audiovisual speech

Recent advances in the automatic recognition of audiovisual speech

Tracking Human Faces in Real-Time,

Three-dimensional sound localization from a compact non-coplanar array of microphones using tree-based learning

Contact Info

Product

Resources

About