2006
DOI: 10.1109/tsa.2005.860354
|View full text |Cite
|
Sign up to set email alerts
|

Mask estimation for missing data speech recognition based on statistics of binaural interaction

Abstract: Abstract-This paper describes a perceptually motivated computational auditory scene analysis (CASA) system that combines sound separation according to spatial location with the "missing data" approach for robust speech recognition in noise. Missing data time-frequency masks are created using probability distributions based on estimates of interaural time and level differences (ITD and ILD) for mixed utterances in reverberated conditions; these masks indicate which regions of the spectrum constitute reliable ev… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
48
0

Year Published

2010
2010
2020
2020

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 52 publications
(49 citation statements)
references
References 22 publications
(28 reference statements)
1
48
0
Order By: Relevance
“…Other systems localize the source in each band probabilistically and then combine probabilities across frequency by assuming statistical independence. Nonparametric modeling in this vein [9], [20], [21] employs histograms of interaural parameters collected over a large amount of training data, which can be compared to the observation and to one another when normalized properly. While [9], [21] collect histograms of perband interaural time differences, [20] collects histograms of interaural phase difference, which avoids multimodality and facilitates the analysis of moments.…”
Section: A Backgroundmentioning
confidence: 99%
“…Other systems localize the source in each band probabilistically and then combine probabilities across frequency by assuming statistical independence. Nonparametric modeling in this vein [9], [20], [21] employs histograms of interaural parameters collected over a large amount of training data, which can be compared to the observation and to one another when normalized properly. While [9], [21] collect histograms of perband interaural time differences, [20] collects histograms of interaural phase difference, which avoids multimodality and facilitates the analysis of moments.…”
Section: A Backgroundmentioning
confidence: 99%
“…We therefore generate a weight associated with that measures the change in signal energy over time. First, we define a recursive method to measure the average signal energy in both left and right channels as follows: (4) Here and , where denotes the time constant for integration and is the sampling frequency of the signals. We set ms and kHz.…”
Section: Cue Weightingmentioning
confidence: 99%
“…Time-frequency masking techniques have been proposed to deal with segregation in reverberant environments [4], [5]. Recent approaches have relied on probabilistic frameworks that jointly perform source localization and time-frequency masking to segregate multiple sources [6]- [8].…”
mentioning
confidence: 99%
“…Numerous algorithms have been proposed for developing the values of M [n, k] based on the inputs (e.g. [6,7,8,9,11,12,13]) and other variations are possible in which M [n, k] is a continuous function of the inputs rather than binary. In the algorithms considered, the mask M [n, k] is typically based on the cell-by-cell comparions of the left and right input signals; however, T-F masking is also widely applied to mono audio to improve signal quality for ASR [14,15,16] and for human intelligibility [17,18].…”
Section: Time-frequency Maskingmentioning
confidence: 99%
“…Results of previous studies using these techniques (e.g. [6,7,8,9,10,11,12]) suggest the following observations (among others): While T-F masking techniques are typically well motivated, there has been little formal mathematical analysis of them, with performance typically expressed in terms of secondary statistics such the accuracy of automatic speech recognition (ASR) systems. While it is true that algorithms developed to improve ASR recognition accuracy must be evaluated in terms of ASR performance, we also believe that further mathematical analysis and comparison to linear beamforming is potentially beneficial, as speech recognition experiments tend to This work has been supported by the National Science Foundation (Grant IIS-I0916918) and the Cisco Corporation (Grant 570877).…”
Section: Introductionmentioning
confidence: 99%