The 1998 HTK system for transcription of conversational telephone speech

Hain, Thomas; Woodland, Philip C.; Niesler, Thomas; Whittaker, Edward W. D.

doi:10.1109/icassp.1999.758061

Cited by 50 publications

(42 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The most popular VTLN technique performs speaker-specific piecewise linear frequency scaling of the Mel-Frequency Cepstral Coefficients (MFCCs) [3]. The overall improvement in the word error rate (WER) obtained with this technique is usually on the order of 0.6% as compared to results obtained without VTLN.…”

Section: Introductionmentioning

confidence: 99%

Acoustic-phonetic speech parameters for speaker-independent speech recognition

Deshmukh

Espy‐Wilson

Juneja

2002

IEEE International Conference on Acoustics Speech and Signal Processing

View full text Add to dashboard Cite

Coping with inter-speaker variability (i.e., differences in the vocal tract characteristics of speakers) is still a major challenge for Automatic Speech Recognizers. In this paper, we discuss a method that compensates for differences in speaker characteristics. In particular, we demonstrate that when continuous density hidden Markov model based system is used as the back-end , a Knowledge-Based Front End (KBFE) can outperform the traditional Mel-Frequency Cepstral Coefficients (MFCCs), particularly when there is a mismatch in the gender and ages of the subjects used to train and test the recognizer.

show abstract

Section: Introductionmentioning

confidence: 99%

Acoustic-phonetic speech parameters for speaker-independent speech recognition

Deshmukh

Espy‐Wilson

Juneja

2002

IEEE International Conference on Acoustics Speech and Signal Processing

View full text Add to dashboard Cite

show abstract

“…Acoustic models are phonetic decision tree state clustered triphone models with standard left-to-right 3-state topology. They were obtained using standard HTKmaximum likelihood training procedures (see for example [11]). The system uses approximately 7000 states where each state is represented as a mixture of 16 Gaussians.…”

Section: Acoustic Modellingmentioning

confidence: 99%

“…Speaker adaptive training is performed in the form of vocal tract length normalisation (VTLN) both in training and test. Warp factors are estimated using a parabolic search procedure, a piecewise linear warping function and a maximum likelihood criterion [11]. Speaker adaptation is perfermed using maximum likelihood linear regression (MLLR) of the means and variances [8].…”

Section: Acoustic Modellingmentioning

confidence: 99%

See 1 more Smart Citation

The Development of the AMI System for the Transcription of Speech in Meetings

Hain

Burget

Dines

et al. 2006

Machine Learning for Multimodal Interaction

View full text Add to dashboard Cite

Abstract. The automatic processing of speech collected in conference style meetings has attracted considerable interest with several large scale projects devoted to this area. This paper describes the development of a baseline automatic speech transcription system for meetings in the context of the AMI (Augmented Multiparty Interaction) project. We present several techniques important to processing of this data and show the performance in terms of word error rates (WERs). An important aspect of transcription of this data is the necessary flexibility in terms of audio pre-processing. Real world systems have to deal with flexible input, for example by using microphone arrays or randomly placed microphones in a room. Automatic segmentation and microphone array processing techniques are described and the effect on WERs is discussed. The system and its components presented in this paper yield compettive performance and form a baseline for future research in this domain.

show abstract

“…Within this category we find techniques such as RASTA-PLP (Hermansky and Morgan (1994)), CMN (Cepstral Mean Normalisation) (Furui (1981)), SCMN (Segmental Cepstral Mean Normalisation) (Viikki and Laurila (1998)), VTLN (Vocal Tract Length Normalisation) (Hain et al (1999)) or histogram equalization (de la Torre et al (2005)). …”

Section: Introductionmentioning

confidence: 99%

The synergy between bounded-distance HMM and spectral subtraction for robust speech recognition

Vicente-Peña

Díaz-de-María

Kleijn

2010

Speech Communication

View full text Add to dashboard Cite

Additive noise generates important losses in automatic speech recognition systems. In this paper, we show that one of the causes contributing to these losses is the fact that conventional recognisers take into consideration feature values that are outliers. The method that we call bounded-distance HMM is a suitable method to avoid that outliers contribute to the recogniser decision. However, this method just deals with outliers, leaving the remaining features unaltered. In contrast, spectral subtraction is able to correct all the features at the expense of introducing some artifacts that, as shown in the paper, cause a larger number of outliers. As a result, we find that bounded-distance HMM and spectral subtraction complement each other well. A comprehensive experimental evaluation was conducted, considering several well-known ASR tasks (of different complexities) and numerous noise types and SNRs. The achieved results show that the suggested combination generally outperforms both the bounded-distance HMM and spectral subtraction individually. Furthermore, the obtained improvements, especially for low and medium SNRs, are larger than the sum of the improvements individually obtained by bounded-distance HMM and spectral subtraction.

show abstract

The 1998 HTK system for transcription of conversational telephone speech

Cited by 50 publications

References 6 publications

Acoustic-phonetic speech parameters for speaker-independent speech recognition

Acoustic-phonetic speech parameters for speaker-independent speech recognition

The Development of the AMI System for the Transcription of Speech in Meetings

The synergy between bounded-distance HMM and spectral subtraction for robust speech recognition

Contact Info

Product

Resources

About