This paper presents speech signal modeling techniques which are well suited to high performance and robust isolated word recognition. Speech is encoded by a discrete cosine transform of its spectra, after several preprocessing steps. Temporal information is then also explicitly encoded into the feature set. We present a new technique for incorporating this temporal information as a function of temporal position within each word. We tested features computed with this method using an alphabet recognition task based on the ISOLET database. The HTK toolkit was used to implement the isolated word recognizer with whole word HMM models. The best result obtained based on 50 features and speaker independent alphabet recognition was 98.0%. Gaussian noise was added to the original speech to simulate a noisy environment. We achieved a recognition accuracy of 95.8% at a SNR of 15 dB. We also tested our recognizer with simulated telephone quality speech by adding noise and band limiting the original speech. For this "telephone" speech, our recognizer achieved 89.6% recognition accuracy. The recognizer was also tested in a speaker dependent mode, resulting in 97.4% accuracy on test data.
INTRODUCTIONContinuous speech recognition systems have been developed for many real-world applications, often using commercial low-cost speech recognition software. However, high performance and robust isolated word recognition, particularly for the letters of the alphabet recognizer and for digits, is still useful for many applications such as recognizing telephone numbers, spelled names and address, and ZIP codes.Because of the potential applications, as mentioned above, many isolated word recognizers are optimized for the digits or alphabet or both (alphadigit). The alphabet recognition task is particularly difficult because there are many highly confusable letters in the alphabet set---for example the great acoustic similarity among the letters of the E-set (b, c, d, e, g, p, t, v, z) or for the (m,n) pair. Also, since language models cannot generally be used, the alphabet recognition task is a small, challenging, and potentially useful problem for evaluating acoustic signal modeling and word recognition methodsSeveral techniques have been proposed to improve isolated word recognition systems. For example, the best result in a speaker independent alphabet recognition was obtained using a multi-tier phoneme-based Hidden Markov Model (HMM) recognizer [5]. Disadvantages of phoneme-based HMM recognizers are the system complexity and the phonetic transcription of the training words has to be known.The main contribution of this paper is to present a method for isolated word recognition which is easier to implement than the state of the art systems introduced to date, and one which gives better performance than any of these previously introduced systems.The ISOLET database, [1], was used for all experiments reported in this paper. This LDC distributed database was intended for evaluation of isolated word recognizers and it has therefore ...