End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories.
This paper describes the technical and system building advances made to the Google Home multichannel speech recognition system, which was launched in November 2016. Technical advances include an adaptive dereverberation frontend, the use of neural network models that do multichannel processing jointly with acoustic modeling, and Grid-LSTMs to model frequency variations. On the system level, improvements include adapting the model using Google Home specific data. We present results on a variety of multichannel sets. The combination of technical and system advances result in a reduction of WER of 8-28% relative compared to the current production system.
Sequence-to-sequence models provide a simple and elegant solution for building speech recognition systems by folding separate components of a typical system, namely acoustic (AM), pronunciation (PM) and language (LM) models into a single neural network. In this work, we look at one such sequence-to-sequence model, namely listen, attend and spell (LAS) [1], and explore the possibility of training a single model to serve different English dialects, which simplifies the process of training multi-dialect systems without the need for separate AM, PM and LMs for each dialect. We show that simply pooling the data from all dialects into one LAS model falls behind the performance of a model fine-tuned on each dialect. We then look at incorporating dialect-specific information into the model, both by modifying the training targets by inserting the dialect symbol at the end of the original grapheme sequence and also feeding a 1-hot representation of the dialect information into all layers of the model. Experimental results on seven English dialects show that our proposed system is effective in modeling dialect variations within a single LAS model, outperforming a LAS model trained individually on each of the seven dialects by 3.1~16.5% relative.
Current state-of-the-art automatic speech recognition systems are trained to work in specific 'domains', defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. This work explores the idea of building a single domain-invariant model for varied use-cases by combining large scale training data from multiple application domains. Our final system is trained using 162,000 hours of speech. Additionally, each utterance is artificially distorted during training to simulate effects like background noise, codec distortion, and sampling rates. Our results show that, even at such a scale, a model thus trained works almost as well as those fine-tuned to specific subsets: A single model can be robust to multiple application domains, and variations like codecs and noise. More importantly, such models generalize better to unseen conditions and allow for rapid adaptation -we show that by using as little as 10 hours of data from a new domain, an adapted domain-invariant model can match performance of a domain-specific model trained from scratch using 70 times as much data. We also highlight some of the limitations of such models and areas that need addressing in future work.
This paper presents a simple and robust consensus decoding approach for combining multiple Machine Translation (MT) system outputs. A consensus network is constructed from an N -best list by aligning the hypotheses against an alignment reference, where the alignment is based on minimising the translation edit rate (TER). The Minimum Bayes Risk (MBR) decoding technique is investigated for the selection of an appropriate alignment reference. Several alternative decoding strategies proposed to retain coherent phrases in the original translations. Experimental results are presented primarily based on three-way combination of Chinese-English translation outputs, and also presents results for six-way system combination. It is shown that worthwhile improvements in translation performance can be obtained using the methods discussed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.