Recognition and understanding of meetings the AMI and AMIDA projects

Renals, Steve; Hain, Thomas; Bourlard, Hervé

doi:10.1109/asru.2007.4430116

Cited by 111 publications

(75 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…It is a suitable collection for our experiments since investigation of the global semantic impact of speech recognition error requires reliable reference transcripts for the complete spoken document collection. As is typical for conversational speech, the word error rate for the corpus ranges up to around 40% [8]. We use the speaker turn segmentation provided with the corpus to divide the data into documents.…”

Section: Datamentioning

confidence: 99%

“…We use the AMI Meeting Corpus (release 1.4) [8], which consists of 100 hours of multimodal data recorded from scenario-based meetings. Included in the corpus are automatic speech recognition transcripts and human-generated reference transcripts.…”

Section: Datamentioning

confidence: 99%

“…The redundancy of language, the tendency of a word to occur in a context containing similar words, compensates for speech recognition error. Research on spoken content retrieval has, however, moved into conversational domains such as interviews [2], lectures [3] and meetings [8]. These domains feature noisy background conditions and wide variation in speaking patterns [1,4].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Investigating the Global Semantic Impact of Speech Recognition Error on Spoken Content Collections

Larson

Tsagkias

et al. 2009

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Errors in speech recognition transcripts have a negative impact on effectiveness of content-based speech retrieval and present a particular challenge for collections containing conversational spoken content. We propose a Global Semantic Distortion (GSD) metric that measures the collection-wide impact of speech recognition error on spoken content retrieval in a query-independent manner. We deploy our metric to examine the effects of speech recognition substitution errors. First, we investigate frequent substitutions, cases in which the recognizer habitually mis-transcribes one word as another. Although habitual mistakes have a large global impact, the long tail of rare substitutions has a more damaging effect. Second, we investigate semantically similar substitutions, cases in which the word spoken and the word recognized do not diverge radically in meaning. Similar substitutions are shown to have slightly less global impact than semantically dissimilar substitutions.

show abstract

Section: Datamentioning

confidence: 99%

Section: Datamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Investigating the Global Semantic Impact of Speech Recognition Error on Spoken Content Collections

Larson

Tsagkias

et al. 2009

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Discriminative non-linear feature transformations can provide yet further gains in performance, because the transformation is optimized to reduce the error rate in the context of the decoder (e.g., [18]). Some of the popular non-linear transforms provide an approximately piece-wise linear transform by the inclusion of "regionbased" features based on Gaussian posterior probabilities.…”

Section: Introductionmentioning

confidence: 99%

Effectiveness of discriminative training and feature transformation for reverberated and noisy speech

Tachioka¹,

Watanabe²,

Hershey³

2013

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

Automatic speech recognition in the presence of non-stationary interference and reverberation remains a challenging problem. The 2nd Annual Speech Separation and Recognition Challenge introduces a new and difficult task with time-varying reverberation and non-stationary interference including natural background speech, home noises, or music. This paper establishes baselines using state-of-the-art ASR techniques such as discriminative training and various feature transformation on the middle-vocabulary sub-task of this challenge. In addition, we propose an augmented discriminative feature transformation that introduces arbitrary features to a discriminative feature transformation. We present experimental results showing that discriminative training of model parameters and feature transforms is highly effective for this task, and that the augmented feature transformation provides some preliminary benefits. The training code will be released as an advanced ASR baseline. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. ABSTRACT Automatic speech recognition in the presence of non-stationary interference and reverberation remains a challenging problem. The 2 nd 'CHiME' Speech Separation and Recognition Challenge introduces a new and difficult task with time-varying reverberation and non-stationary interference including natural background speech, home noises, or music. This paper establishes baselines using state-of-the-art ASR techniques such as discriminative training and various feature transformation on the middle-vocabulary sub-task of this challenge. In addition, we propose an augmented discriminative feature transformation that introduces arbitrary features to a discriminative feature transformation. We present experimental results showing that discriminative training of model parameters and feature transforms is highly effective for this task, and that the augmented feature transformation provides some preliminary benefits. The training code will be released as an advanced ASR baseline.

show abstract

“…Speaker adaptation methods such as SAT and fMLLR were originally developed for decreasing the variation between speakers, but they are also known to improve the ASR accuracy in noisy environments by adapting to unknown and changing noise conditions in effect, performing noise adaptive training [12], [25], [39]. Discriminative non-linear feature transformations can provide yet further gains in performance, because the feature transformation is optimized to reduce directly the error rates of the decoder [33].…”

Section: Introductionmentioning

confidence: 99%

Prior-based Binary Masking and Discriminative Methods for Reverberant and Noisy Speech Recognition Using Distant Stereo Microphones

Tachioka

Watanabe

Roux

et al. 2017

Journal of Information Processing

View full text Add to dashboard Cite

Reverberant and noisy automatic speech recognition (ASR) using distant stereo microphones is a very challenging, but desirable scenario for home-environment speech applications. This scenario can often provide prior knowledge such as physical information about the sound sources and the environment in advance, which may then be used to reduce the influence of the interference. We propose a method to enhance the binary masking algorithm by using prior distributions of the time difference of arrival. This paper also validates state-of-the-art ASR techniques that include various discriminative training and feature transformation methods. Furthermore, we develop an efficient method to combine discriminative language modeling and minimum Bayes risk decoding in the ASR post-processing stage. We also investigate the effectiveness of this method when used for reverberated and noisy ASR of deep neural networks (DNNs) as well when used in systems that combine multiple DNNs using different features. Experiments on the medium vocabulary sub-task of the second CHiME challenge show that the system submitted to the challenge achieved a 26.86% word error rate (WER), moreover, the DNN system with the discriminative training, speaker adaptation and system combination achieves a 20.40% WER.

show abstract

Recognition and understanding of meetings the AMI and AMIDA projects

Cited by 111 publications

References 49 publications

Investigating the Global Semantic Impact of Speech Recognition Error on Spoken Content Collections

Investigating the Global Semantic Impact of Speech Recognition Error on Spoken Content Collections

Effectiveness of discriminative training and feature transformation for reverberated and noisy speech

Prior-based Binary Masking and Discriminative Methods for Reverberant and Noisy Speech Recognition Using Distant Stereo Microphones

Contact Info

Product

Resources

About