Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains [1] to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size.
Experimental evidence has consistently confirmed the ability of uninformed traders, even novices, to infer information from the trading process. After contrasting brain activation in subjects watching markets with and without insiders, we hypothesize that Theory of Mind (ToM) helps explain this pattern, where ToM refers to the human capacity to discern malicious or benevolent intent. We find that skill in predicting price changes in markets with insiders correlates with scores on two ToM tests. We document GARCH-like persistence in transaction price changes that may help investors read markets when there are insiders. THIS PAPER REPORTS RESULTS FROM EXPERIMENTS meant to explore how uninformed traders read information from transaction prices and order flow in financial markets with insiders. Since the seminal experiments of Charles Plott and Shyam Sunder in the early 1980s (Plott and Sunder (1988)), it has been repeatedly confirmed (as we will do here too) that uninformed traders are quite capable of quickly inferring the signals that informed traders (insiders) have about future dividends, despite anonymity of the trading process, despite a lack of structural knowledge of the situation, and despite the absence of long histories from which they can learn the market's statistical regularities.It is striking that so little is understood about the ability of the uninformed to infer the signals of others. This ability constitutes the basis of the efficient markets hypothesis (Fama (1991)), EMH, which states that prices fully reflect all available information. Underlying EMH are the ideas that the uninformed will trade on the signals they manage to infer, and that, through the orders of the uninformed, these signals are effectively amplified in the price formation process. In the extreme, prices will fully reflect all available information. However, without a better understanding of how the uninformed read * Bruguier, Quartz, and Bossaerts are at the California Institute of Technology, and Bossaerts is also at the Ecole Polytechnique Fédérale Lausanne.
We study the problem of compressing recurrent neural networks (RNNs). In particular, we focus on the compression of RNN acoustic models, which are motivated by the goal of building compact and accurate speech recognition systems which can be run efficiently on mobile devices. In this work, we present a technique for general recurrent model compression that jointly compresses both recurrent and non-recurrent inter-layer weight matrices. We find that the proposed technique allows us to reduce the size of our Long Short-Term Memory (LSTM) acoustic model to a third of its original size with negligible loss in accuracy.
In conventional speech recognition, phoneme-based models outperform grapheme-based models for non-phonetic languages such as English. The performance gap between the two typically reduces as the amount of training data is increased. In this work, we examine the impact of the choice of modeling unit for attention-based encoder-decoder models. We conduct experiments on the LibriSpeech 100hr, 460hr, and 960hr tasks, using various target units (phoneme, grapheme, and word-piece); across all tasks, we find that grapheme or word-piece models consistently outperform phoneme-based models, even though they are evaluated without a lexicon or an external language model. We also investigate model complementarity: we find that we can improve WERs by up to 9% relative by rescoring N-best lists generated from a strong word-piece based baseline with either the phoneme or the grapheme model. Rescoring an N-best list generated by the phonemic system, however, provides limited improvements. Further analysis shows that the word-piece-based models produce more diverse N-best hypotheses, and thus lower oracle WERs, than phonemic models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.