Improving the feedback quality of a computer-based system for pronunciation training requires rather detailed and precise knowledge about the place and the nature of actual mispronunciations in a student’s utterance. To be able to provide this kind of information, components for the automatic localisation and correction of pronunciation errors have been developed. This work was part of a project aimed at integrating state-of-the-art speech recognition technology into a pronunciation training environment for adult, intermediate level learners. Although the technologies described here are in principle valid for any language pairs, the current system focuses on Italian and German learners of English.
Large phonetic corpora including both standard and variant transcriptions are available for many languages. However, applications requiring the use of dynamic vocabularies make necessary to transcribe words not present in the dictionary. Also, additional alternative pronunciations to standard forms have shown to improve recognition accuracy. Therefore, new techniques to automatically generate variants in pronunciations have been investigated and proven to be very effective. However, rule-based systems still remain useful to generate standard transcriptions not previously available or to build new corpora, oriented chiefly to synthesis applications. The present paper describes a letter-to-phone conversion system for Spanish designed to supply transcriptions to the flexible vocabulary speech recogniser and to the synthesiser, both developed at CSELT (Centro Studi e Laboratori relecomunicazioni), Turin, Italy. Different sets of rules are designed for the two applications. Symbols inventories also differ, although the IPA alphabet is the reference system for both. Rules have been written in ANSI C and implemented on DOS and Windows 95 and can be selectively applied. Two speech corpora have been transcribed by means of these grapheme-to-phoneme conversion rules: a) the SpeechDat Spanish corpus which includes 4444 words extracted from the phonetically balanced sentences of the database b) a corpus designed to train an automatic aligner to segment units for synthesis, composed of 303 sentences (3240 words) and 338 isolated words; rule-based transcriptions of this corpus were manually corrected. The phonetic forms obtained by the rules matched satisfactorily the reference transcriptions: most mistakes on the first corpus were caused by the presence of secondary stresses in the SpeechDat transcriptions, which were not assigned by the rules, whereas errors on the synthesis corpus appeared mostly on hiatuses and on words of foreign origin. Further developments oriented to recognition can imply addition of rules to account for Latin American pronunciations (especially Mexican, Argentinian and Paraguayan); for synthesis, on the other hand, rules to represent coarticulatory phenomena at word boundaries can be implemented, in order to transcribe whole sentences.
The goal of the present study is to model the ‘‘iceberg’’ portions of the demisyllables, previously extracted from the microbeam articulatory data (Bonaventura, 2003), by curve fitting. The polynomial analysis was designed to include an appropriate weighting window centering around the threshold crossing point, and aimed to provide an estimate of how, in the vicinity of the crossing point, the curve deviates from a straight line: this deviation would be represented by the higher order coefficients of the polynomial. The model was obtained preliminarily on the basis of 100 curves for the lower lip movement for /f/ and /v/ (in initial and final demisyllable for ‘‘five’’), and from 100 curves for the tongue tip displacement (for /n/ in ‘‘nine’’). In order to fit the data to the model, a robust least square method (Least Absolute Residuals) has been used, in order to minimize the influence of the outliers, that are present in the read speech data, and cannot be accounted for by ‘‘phrase final lengthening effects.’’ The fit results for the cubic polynomials satisfactorily approximated the ‘‘iceberg’’ curves. The 95% confidence bounds on the fitted coefficients indicated that they were acceptably accurate.
The goal of the present study was to test whether models of portions of curves, representing movements of the crucial articulator for production of place in syllables containing labiodental and alveolar gestures for production of obstruents ('iceberg' portions of demisyllables), that had previously been found to be stable across different prosodic conditions (Bonaventura, 2003; 2005; 2006; Bonaventura and Fujimura, 2007), a) remained stable across different subjects pronunciations, for each consonantal class b) were significantly different for the two different consonantal movements. Curves were previously extracted from microbeam articulatory data, from 3 subjects for Lower Lip movement (LL) for word 'five' and 3 subjects for Tongue Tip (TT) movement for word 'nine'. Curve fitting models were obtained, by using a best fit fourth order polynomial, from a total of 1193 curves representing lower lip vertical displacement for production of [f] and [v] in 'five' and from a total of 610 curves representing tongue tip vertical displacement for production of [n] in 'nine'. Coefficients were statistically compared, to verify presence of a) non significant difference between models across pronunciations by 3 subjects b) significant difference between the two generalized curves for [f] vs. [t] across subjects. Positive results from (a) would support the hypothesis of presence of articulatory pattern that would remain stable across different prosodic conditions and inter-subject variability, possibly indicating properties of an identifiable articulatory unit. Positive results from (b) would possibly indicate a consistent difference between crucial articulator movements for production of labiodental vs. dental consonantal gestures. Results showed no expected similarity between 'movement curves across subjects pronunciations, except for some stability in the TT coefficients in coda. However, the comparison between the coefficients of the generalized models for TT and LL showed significant differences between the two movements in final demisyllable. These results partially confirm the expected difference between models, indicating that at least the fitted curves for TT and for LL in final demisyllable, if more stable across subjects realizations, could be considered as a reference pattern, representing normal speech for comparison with abnormal production.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.