Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition

Kanthak, Stephan; Ney, Hermann

doi:10.1109/icassp.2002.5743871

Cited by 26 publications

(34 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first system, or "Grapheme" in the figure, with 26 letter units, starts 3% absolute lower than the "Phoneme Baseline w/ Autogen Prons" system for the smallest training set, but outperforms it as the amount of training data increases (76.9% vs 76.5% for the largest training set). This is consistent with Kanthak's and Killer's observations [7,8] (Kanthak's English training set contained less than 100 hours of speech). It is also consistent with our intuition that training data can somewhat compensate for the acoustic diversity of English letters by implicitly modeling the various sounds corresponding to each letter symbol.…”

Section: Resultssupporting

confidence: 89%

“…Kanthak et al [7] and Killer et al [8] observed experimentally that for some languages, grapheme systems performed roughly as well as phoneme systems, but that for others, such as English, there was a high error-rate cost to moving to graphemes. This was attributed by the authors to the poor spelling to pronunciation correspondance of the English language, which is another way of observing that, in English, letter units lack acoustic consistency, and that consistency matters, much like Cravero et al had suggested.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Revisiting graphemes with increasing amounts of data

Sung

Hughes

Beaufays

et al. 2009

2009 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

Letter units, or graphemes, have been reported in the literature as a surprisingly effective substitute to the more traditional phoneme units, at least in languages that enjoy a strong correspondence between pronunciation and orthography. For English however, where letter symbols have less acoustic consistency, previously reported results fell short of systems using highly-tuned pronunciation lexicons. Grapheme units simplify system design, but since graphemes map to a wider set of acoustic realizations than phonemes, we should expect grapheme-based acoustic models to require more training data to capture these variations.In this paper, we compare the rate of improvement of grapheme and phoneme systems trained with datasets ranging from 450 to 1200 hours of speech. We consider various grapheme unit configurations, including using letter-specific, onset, and coda units. We show that the grapheme systems improve faster and, depending on the lexicon, reach or surpass the phoneme baselines with the largest training set.

show abstract

Section: Resultssupporting

confidence: 89%

Section: Introductionmentioning

confidence: 99%

Revisiting graphemes with increasing amounts of data

Sung

Hughes

Beaufays

et al. 2009

2009 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

show abstract

“…This is often unavailable or may be inconsistent if derived from multiple sources. Alternatively a grapheme-based speech recognition system [1,2] could be built. The recogniser then only needs an orthographic lexicon to specify the vocabulary rather than a pronunciation lexicon.…”

Section: Introductionmentioning

confidence: 99%

Unicode-based graphemic systems for limited resource languages

Gales

Knill

Ragni

2015

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Large vocabulary continuous speech recognition systems require a mapping from words, or tokens, into sub-word units to enable robust estimation of acoustic model parameters, and to model words not seen in the training data. The standard approach to achieve this is to manually generate a lexicon where words are mapped into phones, often with attributes associated with each of these phones. Contextdependent acoustic models are then constructed using decision trees where questions are asked based on the phones and phone attributes. For low-resource languages, it may not be practical to manually generate a lexicon. An alternative approach is to use a graphemic lexicon, where the "pronunciation" for a word is defined by the letters forming that word. This paper proposes a simple approach for building graphemic systems for any language written in unicode. The attributes for graphemes are automatically derived using features from the unicode character descriptions. These attributes are then used in decision tree construction. This approach is examined on the IARPA Babel Option Period 2 languages, and a Levantine Arabic CTS task. The described approach achieves comparable, and complementary, performance to phonetic lexicon-based approaches.

show abstract

“…There is a wide variety of solutions that addresses these problems in different ways, ranging from the detection of large vocabularies [17], through the detection of spoken numbers for telephone applications [18], to the detection of segments, spoken or not spoken [19]. When working in complex environments with limited amount of data, multilingual contexts, nonlinearities, or uncontrollable noise, some possibilities are based on: enriching poor resources of a language with resources from another powerful language beside it, approaches oriented to the lack of resources, cross-lingual approaches [20], training of acoustic models for a new language using results from other languages [21], data optimization methods, collaborative systems, or open configuration systems. However, the development of a robust ASR system is very tough when there are under-resourced languages involved, even if there are powerful languages beside them, and the classic techniques perform poorly with regard to correct rates [22,23].…”

Section: Description Of the Environment And Requirements Of The Systemmentioning

confidence: 99%

Multilingual audio information management system based on semantic knowledge in complex environments

Ipiña

Barroso

Calvo

et al. 2020

Neural Comput & Applic

View full text Add to dashboard Cite

This paper proposes a multilingual audio information management system based on semantic knowledge in complex environments. The complex environment is defined by the limited resources (financial, material, human, and audio resources); the poor quality of the audio signal taken from an internet radio channel; the multilingual context (Spanish, French, and Basque that is in under-resourced situation in some areas); and the regular appearance of cross-lingual elements between the three languages. In addition to this, the system is also constrained by the requirements of the local multilingual industrial sector. We present the first evolutionary system based on a scalable architecture that is able to fulfill these specifications with automatic adaptation based on automatic semantic speech recognition, folksonomies, automatic configuration selection, machine learning, neural computing methodologies, and collaborative networks. As a result, it can be said that the initial goals have been accomplished and the usability of the final application has been tested successfully, even with non-experienced users. Keywords Evolutionary computing Á Artificial neural networks Á Internet information management Á Management of complex systems Abbreviations AdiUP Audio information management system ANN Artificial neural networks APD Acoustic phonetic decoding ASR Automatic speech recognition Cl Corrrect rates for classes Co Correct rates for concepts FFT Fast Fourier transform FIP Filler insertion penalty

show abstract

Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition

Cited by 26 publications

References 12 publications

Revisiting graphemes with increasing amounts of data

Revisiting graphemes with increasing amounts of data

Unicode-based graphemic systems for limited resource languages

Multilingual audio information management system based on semantic knowledge in complex environments

Contact Info

Product

Resources

About