End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories.
We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an openended dictation task, running with a median speed that is seven times faster than real-time.
We explain dialogue management techniques for collaborative activities with humans, involving multiple concurrent tasks. Conversational context for multiple concurrent activities is represented using a "Dialogue Move Tree" and an "Activity Tree" which support multiple interleaved threads of dialogue about different activities and their execution status. We also describe the incremental message selection, aggregation, and generation method employed in the system.
In this paper we describe the development of an accurate, smallfootprint, large vocabulary speech recognizer for mobile devices. To achieve the best recognition accuracy, state-of-the-art deep neural networks (DNNs) are adopted as acoustic models. A variety of speedup techniques for DNN score computation are used to enable real-time operation on mobile devices. To reduce the memory and disk usage, on-the-fly language model (LM) rescoring is performed with a compressed n-gram LM. We were able to build an accurate and compact system that runs well below real-time on a Nexus 4 Android phone.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.