2004
DOI: 10.1155/s1110865704402303
|View full text |Cite
|
Sign up to set email alerts
|

Detection and Separation of Speech Event Using Audio and Video Information Fusion and Its Application to Robust Speech Interface

Abstract:

A method of detecting speech events in a multiple-sound-source condition using audio and video information is proposed. For detecting speech events, sound localization using a microphone array and human tracking by stereo vision is combined by a Bayesian network. From the inference results of the Bayesian network, information on the time and location of speech events can be known. The information on the detected speech events is then utilized in the robust speech interface. A maximum likelihood adaptiv… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
13
0

Year Published

2008
2008
2017
2017

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 26 publications
(13 citation statements)
references
References 14 publications
0
13
0
Order By: Relevance
“…Signal processing technique uses MUSIC spectrum method and fusion of video by Bayesian network, and it reduces environment noise by using ML beamforming (details are described in [17,18]). …”
Section: Noise Robust Speech Recognitionmentioning
confidence: 99%
“…Signal processing technique uses MUSIC spectrum method and fusion of video by Bayesian network, and it reduces environment noise by using ML beamforming (details are described in [17,18]). …”
Section: Noise Robust Speech Recognitionmentioning
confidence: 99%
“…Bayesian networks are a way of modeling a joint probability distribution of multiple random variables. In [10], a Bayesian network was used to detect the time and position of speech events by analyzing audio and video data. The gained information was then utilized to robustly recognize and separate speech signals in noisy and reverberant environments.…”
Section: Introductionmentioning
confidence: 99%
“…The introduced tracking algorithm is solely based on color distributions to identify and track moving objects in a video sequence. It is a robust technique more flexible than the background subtraction method [10] and well-suited for abrupt changes in the camera position as well as for alterations in the environment [14].…”
Section: Introductionmentioning
confidence: 99%
“…Most existing methods for speaker detection are realized by combining techniques of sound localization via a microphone array and human tracking via background subtraction by using coupled Hidden Markov Models (HMMs) or Dynamic Bayesian Networks (DBNs) [11,2]. However, because of the spatial resolution of the microphone array, these methods can become ineffective in situations where speakers are physically close to each other.…”
Section: Introductionmentioning
confidence: 99%