While artificial neural networks have been in existence for over half a century, it was not until year 2010 that they had made a significant impact on speech recognition with a deep form of such networks. This invited paper, based on my keynote talk given at Interspeech conference in Singapore in September 2014, will
I . I N T R O D U C T I O NThe main theme of this paper is to reflect on the recent history of how deep learning has profoundly revolutionized the field of automatic speech recognition (ASR) and to elaborate on what kind of lessons we can learn to not only further advance ASR technology but also to impact the related, arguably more important, applications in language and multimodal processing. Language processing concerns "downstream" analysis and distillation of information from the ASR systems' outputs. Semantic analysis of language and multimodal processing involving speech, text, and image, both experiencing rapid advances based on deep learning over the past few years, holds the potential to solve some difficult and remaining ASR problems and present new challenges for the deep learning technology.A message to be conveyed in this paper is the importance of broadening deep learning from deep neural networks (DNNs) to include deep generative models as well. In fact, a brief historical review conducted in Section II will touch Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA Corresponding author: Li Deng Email: deng@microsoft.com on how the development of deep (and dynamic) generative models of speech played a role in the inroads of DNNs into modern ASR. Since 2011, the DNN has taken over the dominating (shallow) generative model of speech, the Gaussian Mixture Model (GMM), as the output distribution in the Hidden Markov Model (HMM). This purely discriminative DNN has been well-known to the ASR community, which can be considered as a shallow network unfolding in space. When the unfolding occurs in time, we have the recurrent neural network (RNN). On the other hand, deep generative models have distinct advantages over discriminative DNNs, including the strengths of model interpretability, of embedding domain knowledge and causal relationships, and of modeling uncertainty. Deep generative and discriminative models represent two apparently opposing approaches yet with highly complementary strengths and weaknesses. The further success of deep learning is likely to lie in how to seamlessly integrate the two approaches in a practically effective and theoretically appealing fashion, and to achieve the best of both worlds.The remainder of this paper is organized as follows. In Section II, some brief history is provided on how deep learning made inroad into speech recognition, and a number of enabling factors are discussed. Outstanding achievements of deep learning both in academic world and in industry to 1 https://www.cambridge.org/core/terms. https://doi