The acoustic models in state-of-the-art speech recognition systems are based on phones in context that are represented by hidden Markov models. This modeling approach may be limited in that it is hard to incorporate long-span acoustic context. Exemplar-based approaches are an attractive alternative, in particular if massive data and computational power are available. Yet, most of the data at Google are unsupervised and noisy. This paper investigates an exemplar-based approach under this yet not well understood data regime. A log-linear rescoring framework is used to combine the exemplar-based features on the word level with the first-pass model. This approach guarantees at least baseline performance and focuses on the refined modeling of words with sufficient data. Experimental results for the Voice Search and the YouTube tasks are presented.Index Terms-Exemplar-based speech recognition, conditional random fields, speech recognition
INTRODUCTIONState-of-the-art speech recognition systems are based on hidden Markov models (HMMs) to represent phones in context. These models are convenient due to their simplicity and compactness. However, it is hard to incorporate long-span acoustic context into this type of models, without pooling observations from different examples on the frame level.Non-parametric, exemplar-based approaches such as knearest neighbors (kNN) appear to be an attractive alternative to overcome this limitation of conventional HMMs and may be more effective at capturing the large variability of speech. In this paper, we investigate an exemplar-based (also known as template-based) rescoring approach to speech recognition, which can be considered a variant of kNN on (pre-)segmented acoustic units such as words.Like for most non-parametric approaches, the main concerns about exemplar-based speech recognition are that it requires large amounts of data and thus, massive computational power. The origin of the complexity is twofold. First, there is no compact representation as in case of conventional HMMs and all data need to be memorized and processed. Second, the Dynamic Time Warping (DTW) distance [1, 2] is used to measure the similarity between two templates. Using