Combined optimisation of baseforms and model parameters in speech recognition based on acoustic subword units

Holter, Trym; Svendsen, Torbjørn

doi:10.1109/asru.1997.659006

Cited by 14 publications

(12 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In (Holter and Svendsen, 1997), this was done through an iterative process of acoustic model estimation and pronunciation generation. In Ostendorf, 1999, 1998), a segmentation and clustering approach was exploited for derivation of subword units, with two main differences compared to the approaches explained in Section 2.3.1: (1) in the segmentation step, pronunciation related constraints is applied such that a given word has the same number of segments across the acoustic training data, and (2) a maximum-likelihood criteria that is consistent for both segmentation and clustering is utilized.…”

Section: Joint Approaches For Aswu Derivation and Pronunciation Genermentioning

confidence: 99%

“…In the literature, interest in acoustic subword unit (ASWU) based lexicon development emerged from the pronunciation variation modeling perspective, specifically with the idea of overcoming limitation of linguistically motivated subword units, i.e., phones (Lee et al, 1988;Svendsen et al, 1989;Paliwal, 1990;Lee et al, 1988;Bacchiani and Ostendorf, 1998;Holter and Svendsen, 1997). However, recently, there has been a renewed interest from the perspective of handling lexical resource constraints (Singh et al, 2000;Hartmann et al, 2013).…”

mentioning

confidence: 99%

See 1 more Smart Citation

Towards weakly supervised acoustic subword unit discovery and lexicon development using hidden Markov models

Razavi

Rasipuram

Magimai.-Doss

2018

Speech Communication

View full text Add to dashboard Cite

State-of-the-art automatic speech recognition and text-to-speech systems are based on subword units, typically phonemes. This necessitates a lexicon that maps each word to a sequence of subword units. Development of a phonetic lexicon for a language requires linguistic knowledge as well as human effort, which may not be always readily available, particularly for under-resourced languages.In such scenarios, an alternative approach is to use a lexicon based on units such as, graphemes or subword units automatically derived from the acoustic data. This article focuses on automatic subword unit based lexicon development using methods that are employed for development of grapheme-based systems.Specifically, we present a novel hidden Markov model (HMM) based formalism for automatic derivation of subword units and pronunciation generation using only transcribed speech data. In this approach, the subword units are derived from the clustered context-dependent units in a grapheme based system using the maximum-likelihood criterion. The subword unit based pronunciations are then generated by learning either a deterministic or a probabilistic relationship between the graphemes and the acoustic subword units (ASWUs). In this article, we first establish the proposed framework on a well resourced language by comparing it against related approaches in the literature and investigating the transferability of the derived subword units to other domains. We then show the scalability of the proposed approach on real under-resourced scenarios by conducting studies on Scottish Gaelic, a genuinely under-resourced language, * Corresponding author Email addresses: marzieh.razavi@idiap.ch (Marzieh Razavi), ramya.murali@gmail.com (Ramya Rasipuram), mathew@idiap.ch (Mathew Magimai.-Doss) Preprint submitted to ElsevierMarch 17, 2017and comparing the approach against state-of-the-art grapheme-based ASR approaches. Our experimental studies on English show that the derived subword units can not only lead to better ASR systems compared to graphemes, but can also be transferred across domains. The experimental studies on Scottish Gaelic show that the proposed ASWU-based lexicon development approach scales without any language specific considerations and leads to better ASR systems compared to a grapheme-based lexicon, including the case where ASR system performance is boosted through the use of acoustic models built with multilingual resources from resource-rich languages.

show abstract

Section: Joint Approaches For Aswu Derivation and Pronunciation Genermentioning

confidence: 99%

mentioning

confidence: 99%

Towards weakly supervised acoustic subword unit discovery and lexicon development using hidden Markov models

Razavi

Rasipuram

Magimai.-Doss

2018

Speech Communication

View full text Add to dashboard Cite

show abstract

“…This paper aims to find a subword unit suitable for spontaneous speech recognition. Similar to our approach, some studies [13][14][15][16][17] have attempted to overcome the limitations of the phoneme unit. These studies focused on automatically deriving subword units from speech signals and constructing a lexicon based on them; this was done to build the subword unit using a data-driven, rather than hand-crafted approach.…”

Section: Related Workmentioning

confidence: 99%

Acoustic Data-Driven Subword Units Obtained through Segment Embedding and Clustering for Spontaneous Speech Recognition

2020

View full text Add to dashboard Cite

We propose a method to extend a phoneme set by using a large amount of broadcast data to improve the performance of Korean spontaneous speech recognition. In the proposed method, we first extract variable-length phoneme-level segments from broadcast data and then convert them into fixed-length embedding vectors based on a long short-term memory architecture. We use decision tree-based clustering to find acoustically similar embedding vectors and then build new acoustic subword units by gathering the clustered vectors. To update the lexicon of a speech recognizer, we build a lookup table between the tri-phone units and the units derived from the decision tree. Finally, the proposed lexicon is obtained by updating the original phoneme-based lexicon by referencing the lookup table. To verify the performance of the proposed unit, we compare the proposed unit with the previous units obtained by using the segment-based k-means clustering method or the frame-based decision-tree clustering method. As a result, the proposed unit is shown to produce better performance than the previous units in both spontaneous, and read Korean speech recognition tasks.In spontaneous speech recognition, the phoneme unit has a problem of acoustically low discrimination. In more detail, the phoneme unit in spontaneous speech has a smaller inter-unit distance and a larger variance than the phoneme unit in read speech, which is one of the major factors contributing to the decrease in recognition accuracy [7,8]. In general, using a decision tree in the implicit method shows improved speech recognition accuracy when segmented from acoustically discriminative units. This is also confirmed by the fact that a speech recognizer for read speech has shown better performance when segmented based on the phoneme unit instead of the grapheme unit. Thus, if we build an acoustically discriminative unit by clustering common spectral patterns from spontaneous speech, we can expect an improvement in the performance of spontaneous speech recognition.We propose a method to improve the performance of spontaneous speech recognition by extending the phoneme set with a large amount of Korean broadcast data. The proposed unit is extracted in three steps. We first extract variable-length phoneme-level segments and then convert them into fixed-length latent vectors based on a long short-term memory (LSTM) architecture [9]. Finally, we use the decision tree-based clustering algorithm [4,10] to cluster acoustically similar latent vectors and then build a new acoustic subword unit by gathering the clustered vectors. In the unit derivation experiments, we compare the proposed and previous approaches [9,11] in terms of the fixed-length vector extraction and the clustering algorithm. The proposed unit is shown to produce better performance than the acoustic subword units obtained by previous methods, in both spontaneous and read speech recognition tasks.This paper is an extension of our previous conference paper [9] that improves the clustering method from a k-means clusteri...

show abstract

“…In [13,14], approaches based on maximum likelihood criterion are proposed. In [15], the authors provide a hierarchical Bayesian model to jointly learn the subword units and pronunciations.…”

Section: Introductionmentioning

confidence: 99%

An HMM-based formalism for automatic subword unit derivation and pronunciation generation

Razavi

Magimai.-Doss

2015

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose a novel hidden Markov model (HMM) formalism for automatic derivation of subword units and pronunciation generation using only transcribed speech data. In this approach, the subword units are derived from the clustered context-dependent units in a grapheme based system using maximum-likelihood criterion. The subword unit based pronunciations are then learned in the framework of Kullback-Leibler divergence based HMM. The automatic speech recognition (ASR) experiments on WSJ0 English corpus show that the approach leads to 12.7% relative reduction in word error rate compared to grapheme-based system. Our approach can be beneficial in reducing the need for expert knowledge in development of ASR as well as text-to-speech systems.

show abstract

Combined optimisation of baseforms and model parameters in speech recognition based on acoustic subword units

Cited by 14 publications

References 8 publications

Towards weakly supervised acoustic subword unit discovery and lexicon development using hidden Markov models

Towards weakly supervised acoustic subword unit discovery and lexicon development using hidden Markov models

Acoustic Data-Driven Subword Units Obtained through Segment Embedding and Clustering for Spontaneous Speech Recognition

An HMM-based formalism for automatic subword unit derivation and pronunciation generation

Contact Info

Product

Resources

About