This paper describes two experiments aimed at exploring the relationship between objective properties of speech and perceived fluency in read and spontaneous speech. The aim is to determine whether such quantitative measures can be used to develop objective fluency tests. Fragments of read speech (Experiment 1) of 60 non-native speakers of Dutch and of spontaneous speech (Experiment 2) of another group of 57 non-native speakers of Dutch were scored for fluency by human raters and were analyzed by means of a continuous speech recognizer to calculate a number of objective measures of speech quality known to be related to perceived fluency. The results show that the objective measures investigated in this study can be employed to predict fluency ratings, but the predictive power of such measures is stronger for read speech than for spontaneous speech. Moreover, the adequacy of the variables to be employed appears to be dependent on the specific type of speech material investigated and the specific task performed by the speaker.
To determine whether expert fluency ratings of read speech can be predicted on the basis of automatically calculated temporal measures of speech quality, an experiment was conducted with read speech of 20 native and 60 non-native speakers of Dutch. The speech material was scored for fluency by nine experts and was then analyzed by means of an automatic speech recognizer in terms of quantitative measures such as speech rate, articulation rate, number and length of pauses, number of dysfluencies, mean length of runs, and phonation/time ratio. The results show that expert ratings of fluency in read speech are reliable (Cronbach's a varies between 0.90 and 0.96) and that these ratings can be predicted on the basis of quantitative measures: for six automatic measures the magnitude of the correlations with the fluency scores varies between 0.81 and 0.93. Rate of speech appears to be the best predictor: correlations vary between 0.90 and 0.93. Two other important determinants of reading fluency are the rate at which speakers articulate the sounds and the number of pauses they make. Apparently, rate of speech is such a good predictor of perceived fluency because it incorporates these two aspects.
In this paper, we examine the relationship between pedagogy and technology in Computer Assisted Pronunciation Training (CAPT) courseware. First, we will analyse available literature on second language pronunciation teaching and learning in order to derive some general guidelines for effective training. Second, we will present an appraisal of various CAPT systems with a view to establishing whether they meet pedagogical requirements. In this respect, we will show that many commercial systems tend to prefer technological novelties to the detriment of pedagogical criteria that could benefit the learner more. While examining the limitations of today's technology, we will consider possible ways to deal with these shortcomings. Finally, we will combine the information thus gathered to suggest some recommendations for future CAPT.
An effective way to increase the noise robustness of automatic speech recognition is to label noisy speech features as either reliable or unreliable (missing), and to replace (impute) the missing ones by clean speech estimates. Conventional im putation techniques employ parametric models and impute the missing features on a frame-by-frame basis. At low SNR's these techniques fail, because too many time frames may contain few, if any, reliable features. In this paper we introduce a novel non-parametric, exemplarbased method for reconstructing clean speech from noisy ob servations, based on techniques from the field of Compressive Sensing. The method, dubbed sparse imputation, can impute missing features using larger time windows such as entire words. Using an overcomplete dictionary of clean speech exemplars, the method finds the sparsest combination of exemplars that jointly approximate the reliable features of a noisy utterance. That linear combination of clean speech exemplars is used to replace the missing features. Recognition experiments on noisy isolated digits show that sparse imputation outperforms conventional imputation tech niques at SNR =-5 dB when using an ideal 'oracle' mask. With error-prone estimated masks sparse imputation performs slightly worse than the best conventional technique.
A B S T R A C TThe study of phonetic contrasts and related phenomena, e.g. inter-and intra-speaker variability, often requires to analyse data in the form of measured time series, like f 0 contours and formant trajectories. As a consequence, the investigator has to find suitable ways to reduce the raw and abundant numerical information contained in a bundle of time series into a small but sufficient set of numerical descriptors of their shape. This approach requires one to decide in advance which dynamic traits to include in the analysis and which not. For example, a rising pitch gesture may be represented by its duration and slope, hence reducing it to a straight segment, or by a richer coding specifying also whether (and how much) the rising contour is concave or convex, the latter being irrelevant in some context but crucial in others. Decisions become even more complex when a phenomenon is described by a multidimensional time series, e.g. by the first two formants.In this paper we introduce a methodology based on Functional Data Analysis (FDA) that allows the investigator to delegate most of the decisions involved in the quantitative description of multidimensional time series to the data themselves. FDA produces a data-driven parametrisation of the main shape traits present in the data that is visually interpretable, in the same way as slopes or peak heights are. These output parameters are numbers that are amenable to ordinary statistical analysis, e.g. linear (mixed effects) models. FDA is also able to capture correlations among different dimensions of a time series, e.g. between formants F 1 and F 2 . We present FDA by means of an extended case study on diphthong -hiatus distinction in Spanish, a contrast that involves duration, formant trajectories and pitch contours.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.