2012
DOI: 10.1109/msp.2012.2210952
|View full text |Cite
|
Sign up to set email alerts
|

Subword Modeling for Automatic Speech Recognition: Past, Present, and Emerging Approaches

Abstract: Abstract-Modern automatic speech recognition systems handle large vocabularies of words, making it infeasible to collect enough repetitions of each word to train individual word models. Instead, large-vocabulary recognizers represent each word in terms of sub-word units. Typically the sub-word unit is the phone, a basic speech sound such as a single consonant or vowel. Each word is then represented as a sequence, or several alternative sequences, of phones specified in a pronunciation dictionary. Other choices… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
21
0
1

Year Published

2013
2013
2023
2023

Publication Types

Select...
5
2
2

Relationship

1
8

Authors

Journals

citations
Cited by 32 publications
(22 citation statements)
references
References 54 publications
0
21
0
1
Order By: Relevance
“…The idea of using lexicons based on ASWUs instead of the linguistically motivated units has been appealing to the ASR community for three main reasons: (1) ASWUs tend to be rather data-dependent than linguistic knowledgedependent, as they are typically obtained through optimization of an objective function using training speech data (Lee et al, 1988;Bacchiani and Ostendorf, 1998), (2) they could possibly help in handling pronunciation variations (Livescu et al, 2012), and (3) they can avoid the need for explicit phonetic knowledge .…”
Section: Literature Survey On Aswu Derivation and Pronunciation Genermentioning
confidence: 99%
“…The idea of using lexicons based on ASWUs instead of the linguistically motivated units has been appealing to the ASR community for three main reasons: (1) ASWUs tend to be rather data-dependent than linguistic knowledgedependent, as they are typically obtained through optimization of an objective function using training speech data (Lee et al, 1988;Bacchiani and Ostendorf, 1998), (2) they could possibly help in handling pronunciation variations (Livescu et al, 2012), and (3) they can avoid the need for explicit phonetic knowledge .…”
Section: Literature Survey On Aswu Derivation and Pronunciation Genermentioning
confidence: 99%
“…Constructing speech processing systems for a language usually implies developing an ASR. This requires a significant amount of time-coded and transcribed speech data along with the corresponding text and pronunciation dictionary (Livescu et al, 2012;Rabiner et al, 1993). Unless the training corpus is read speech based on a script, researchers have usually had to provide transcriptions by hand.…”
Section: A Forced Alignmentmentioning
confidence: 99%
“…This is especially useful for underresourced languages lacking a substantial set of training data for a complete ASR system (Badenhorst et al, 2011;Livescu et al, 2012;Li, 2008a, 2008b). The accuracy of this strategy depends closely on the fit between the phone set used to build the aligner and the phone set of the target language.…”
Section: A Forced Alignmentmentioning
confidence: 99%
“…However, the subword units can be derived automatically from the speech signal which can possibly help in better handling of pronunciation variations [1]. Moreover, with growing interest in development of ASR systems for under-resourced languages, attempts have been made to automatically derive subword units as well as the pronunciations based on such units.…”
Section: Introductionmentioning
confidence: 99%