In this paper we show how recognition performance in automated speech perception can be significantly improved by additional Lipreading, so called "Speechreading". We show this on an extension of an existing state-of-the-art speech recognition system, a modular MS-TDNN. The acoustic and visual speech data is preclassified in two separate front-end phoneme TDNNs and combined to acoustic-visual hypotheses for the Dynamic Time Warping algorithm. This is shown on a connected word recognition problem, the notoriously difficult letter spelling task. With speechreading we could reduce the error rate up to half of the error rate of the pure acoustic recogni ti on.
Verbmobil, a German research project, aims at machine translation of spontaneous speech input. The ultimate goal is the development o f a portable machine translator that will allow people to negotiate in their native language. Within this project the University of Karlsruhe has developed a speech recognition engine that has been evaluated on a y early basis during the project and shows very promising speech recognition word accuracy results on large vocabulary spontaneous speech. In this paper we will introduce the Janus Speech Recognition Toolkit underlying the speech recognizer. The main new contributions to the acoustic modeling part of our 1996 evaluation system speaker normalization, channel normalization and polyphonic clustering will be discussed and evaluated. Besides the acoustic models we delineate the di erent language models used in our evaluation system: Word trigram models interpolated with class based models and a separate spelling language model were applied. As a result of using the toolkit and integrating all these parts into the recognition engine the word error rate on the German Spontaneous Scheduling Task GSST could be decreased from 30 word error rate in 1995 to 13.8 in 1996.
The performance of the error backpropagation BP and ID3 learning algorithms was compared on the task of mapping English text to phonemes and stresses. Under the distributed output code developed by Sejnowski and Rosenberg, it is shown that BP consistently out-performs ID3 on this task by several percentage points. Three hypotheses explaining this di erence were explored: a ID3 is over tting the training data, b BP is able to share hidden units across several output units and hence can learn the output units better, and c BP captures statistical information that ID3 does not. We conclude that only hypothesis c is correct. By augmenting ID3 with a simple statistical learning procedure, the performance of BP can be approached but not matched. More complex statistical procedures can improve the performance of both BP and ID3 substantially. A study of the residual errors suggests that there is still substantial room for improvement in learning methods for text-to-speech mapping.
Abstract. The performance of the error backpropagation (BP) and ID3 learning algorithms was compared on the task of mapping English text to phonemes and stresses. Under the distributed output code developed by Sejnowski and Rosenberg, it is shown that BP consistently out-performs ID3 on this task by several percentage points. Three hypotheses explaining this difference were explored: (a) ID3 is overfitting the training data, (b) BP is able to share hidden units across several output units and hence can learn the output units better, and (c) BP captures statistical information that ID3 does not. We conclude that only hypothesis (c) is correct. By augmenting ID3 with a simple statistical learning procedure, the performance of BP can be closely matched. More complex statistical procedures can improve the performance of both BP and ID3 substantially in this domain.
In this paper we show how recognition performance in automated speech perception can be significantly improved by additional Lipreading, so called "speech-reading". We show this on an extension of an existing state-of-theart speech recognition system, a modular MS-TDNN. The acoustic and visual speech data is preclassified in two separate front-end phoneme TDNNs and combined to acousticvisual hypotheses for the Dynamic Time Warping algorithm. This is shown on a connected word recognition problem, the notoriously difficult letter spelling task. With speech-reading we could reduce the error rate up to half of the error rate of the pure acoustic recognition.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.