No abstract
There are several methods to create visualizations of speech data. All of them, however, lack the ability to remove microphone-dependent distortions. We examined the use of Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and the COmprehensive Space Map of Objective Signal (COSMOS) method in this work. To solve the problem of lacking microphone independency of PCA, LDA, and COSMOS, we present two methods to reduce the influence of the recording conditions on the visualization. The first one is a rigid registration of maps created from identical speakers recorded under different conditions, i.e. different microphones and distances. The second method is an extension of the COSMOS method, which performs a non-rigid registration during the mapping procedure. As a measure for the quality of the visualization, we computed the mapping error which occurs during the dimension reduction and the grouping error as the average distance between the representations of the same speaker recorded by different microphones. The best linear method in leave-one-speaker-out evaluation is PCA plus rigid registration with a mapping error of 47 % and a grouping error of 18 %. The proposed method, however, surpasses this even further with a mapping error of 24 % and a grouping error which is close to zero.
SUMMARYIn this paper, the development, long-term operation and portability of a practical ASR application in a real environment is investigated. The target application is a speech-oriented guidance system installed at the local community center. The system has been exposed to ordinary people since November 2002. More than 300 hours or more than 700,000 inputs have been collected during four years. The outcome is a rare example of a large scale real-environment speech database. A simulation experiment is carried out with this database to investigate how the system's performance improves during the first two years of operation. The purpose is to determine empirically the amount of real-environment data which has to be prepared to build a system with reasonable speech recognition performance and response accuracy. Furthermore, the relative importance of developing the main system components, i.e. speech recognizer and the response generation module, is assessed. Although depending on the system's modeling capacities and domain complexity, experimental results show that overall performance stagnates after employing about 10-15k utterances for training the acoustic model, 40-50k utterances for training the language model and 40k-50k utterances for compiling the question and answer database. The Q & A database was most important for improving the system's response accuracy. Finally, the portability of the well-trained first system prototype for a different environment, a local subway station, is investigated. Since collection and preparation of large amounts of real data is impractical in general, only one month of data from the new environment is employed for system adaptation. While the speech recognition component of the first prototype has a high degree of portability, the response accuracy is lower than in the first environment. The main reason is a domain difference between the two systems, since they are installed in different environments. This implicates that it is imperative to take the behavior of users under real conditions into account to build a system with high user satisfaction.
Speech recognition and speech-based dialogue are means for realizing communication between humans and robots. In case of conventional system setup a headset or a directional microphone is used to collect speech with high signal-to-noise ratio (SNR). However, the user must wear a microphone or has to approach the system closely for interaction. Therefore it's preferable to develop a hands-free speech recognition system which enables the user to speak to the system from a distant point. To collect speech from distant speakers a microphone array is usually employed. However, the SNR will degrade in a real environment because of the presence of various kinds of background noise besides the user's utterance. This will most often decrease speech recognition performance and no reliable speech dialogue would be possible. Voice Activity Detection (VAD) is a method to detect the user utterance part in the input signal. If VAD fails, all following processing steps including speech recognition and dialogue will not work. Conventional VAD based on amplitude level and zero cross count is difficult to apply to hands-free speech recognition, because speech detection will most often fail due to low SNR.This paper proposes a VAD method based on the acoustic model (AM) for background noise and the speech recognition algorithm applied to hands-free speech recognition. There will always be non-speech segments at the beginning and end of each user utterance. The proposed VAD approach compares the likelihood of phoneme and silence segments in the top recognition hypotheses during decoding. We implemented the proposed method for the open-source speech recognition engine Julius. Experimental results for various SNRs conditions show that the proposed method attains a higher VAD accuracy and higher recognition rate than conventional VAD.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.