2010
DOI: 10.1109/tasl.2009.2023162
|View full text |Cite
|
Sign up to set email alerts
|

Enhanced Phone Posteriors for Improving Speech Recognition Systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
30
0

Year Published

2010
2010
2021
2021

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 34 publications
(31 citation statements)
references
References 26 publications
1
30
0
Order By: Relevance
“…If this second stage net is trained on several neighboring frames -similar to the MFCC-based net -then it is able to correct some of the errors of the lower stage net(s) with the help of the long-term context. Hence, applying such a second stage network is already useful in itself, as was recently shown in [15] or [16]. We will compare our earlier results with two 2-stage configurations: the first is trained only on the MFCC-based posteriors, while the second combines the MFCC-based and the 2D-DCT based probabilities.…”
Section: Noisy Speech Experimentsmentioning
confidence: 80%
See 1 more Smart Citation
“…If this second stage net is trained on several neighboring frames -similar to the MFCC-based net -then it is able to correct some of the errors of the lower stage net(s) with the help of the long-term context. Hence, applying such a second stage network is already useful in itself, as was recently shown in [15] or [16]. We will compare our earlier results with two 2-stage configurations: the first is trained only on the MFCC-based posteriors, while the second combines the MFCC-based and the 2D-DCT based probabilities.…”
Section: Noisy Speech Experimentsmentioning
confidence: 80%
“…Both physiological and psychoacoustic experimental results indicate that the human brain extracts information from much longer time spans. Technically the simplest solution for this is to work with larger windows along the timeaxis: in neural-net based recognizers it is now standard practice to train the system on 9 or more neighboring MFCC vectors [7,15,16]. However, there is also evidence that the brain processes relatively narrow frequency bands quasi-separately [2,6].…”
Section: Localized Spectro-temporal Featuresmentioning
confidence: 99%
“…This implies that either the transition models were not considerably hurt by the exposure to the temporal randomness in the pseudo-utterances, or that the transition models have only a limited impact on the overall quality of an HMM. While the authors hypothesize that both factors are involved in some way, the latter hypothesis has been supported by studies that went as far as setting all HMM state transition probabilities to a constant, while attaining meaningful performance [e.g., Ketabdar and Bourlard (2010)]. Finally, it is noted the pseudo-utterances still retain a good portion of the meaningful temporal information through the first-and second-order time derivatives included in the frame feature vectors, and this information is learned by the state models of the HMM system.…”
Section: Temporal Smoothingmentioning
confidence: 98%
“…The origins of this technology go back to the era of shallow networks, which-just as DNNs-were trained on a block of consecutive input vectors. Some authors observed that the posterior estimates obtained can be "enhanced" by training yet another network-but this time on a sequence of output vectors coming from the first network [22]. Other authors refer to this approach as the "hierarchical modeling" [23][24][25] or the "stacked modeling" method [26].…”
Section: Introductionmentioning
confidence: 99%