Work on voice sciences over recent decades has led to a proliferation of acoustic parameters that are used quite selectively and are not always extracted in a similar fashion. With many independent teams working in different research areas, shared standards become an essential safeguard to ensure compliance with state-of-the-art methods allowing appropriate comparison of results across studies and potential integration and combination of extraction and recognition systems. In this paper we propose a basic standard acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis. In contrast to a large brute-force parameter set, we present a minimalistic set of voice parameters here. These were selected based on a) their potential to index affective physiological changes in voice production, b) their proven value in former studies as well as their automatic extractability, and c) their theoretical significance. The set is intended to provide a common baseline for evaluation of future research and eliminate differences caused by varying parameter sets or even different implementations of the same parameters. Our implementation is publicly available with the openSMILE toolkit. Comparative evaluations of the proposed feature set and large baseline feature sets of INTERSPEECH challenges show a high performance of the proposed set in relation to its size.
Dr. Waterson is a student of J. R. Firth, whose prosodic theory of linguistic analysis was largely ignored in the United States in the heydays of Structuralism and its successor, Generative Grammar. Today, with rare reference to Firthian prosodic theory as a historical precedent, nonlinear phonology has come to occupy center stage•a recent development that Waterson cites as one motive for presenting a collection of her own work. The volume is intended to suggest the theoretical value of the prosodic approach to an understanding of both children's phonological development and adult perceptual processing. Unfortunately, the work did not receive sufficiently critical editorial attention. The redundant presentation, within, as well as across, individual papers, does a disservice to Waterson's many original observations. Furthermore, the theoretical claims are overly strong in some cases and overly modest in others, leaving it to the reader to work out the real strengths of Waterson's maverick theoretical position.Briefly stated, in partially overlapping segments of five of the seven papers (numbers 2-6) Waterson provides a detailed and insightful analysis of her son's phonological development, from his first words to the beginnings of word combination. She casually notes--in contradiction to some of the prevailing theoretical positions of the 1970s--the child's progress from no system (each word treated idiosyncratically) to the beginnings of system and, concomitantly, from relative accuracy to increased distance from the adult target and increased homonymy, and she nicely illustrates the complex interrelationship of syntagmatic advances on the phonological and syntactic levels (5). Far less compelling is her insistence that she has conclusively traced the child's restricted production patterns to incomplete (skeletal) perception and, hence, reliance on "the most salient perceptual patterns." Although she may well be right, the data are insufficient at any point to decide between perceptual and production constraints, and the argument, repeated in virtually every paper, will have a disturbingly circular ring to even the most sympathetic of readers. The extension of the same claim to adult perceptual processing (8) remains similarly inconclusive. The contribution of the volume to its apparent goal is thus bound to be less than hoped for, and its impact is unlikely to be great. Yet, this is unfortunate, because the data and their careful treatment deserve wider attention, while the claims regarding incomplete perceptual processing at the early stages cry out for experimental testing, both at the psycholinguistic and acoustic levels.
The "singing formant" is a high spectrum envelope peak near 2.8 kHz characteristic of vowel sounds produced in male Western opera and concert singing. An acoustical model of the vocal tract is capable of generating such a peak provided that three conditions are met: (i) The cross-sectional area in the pharynx must be at least six times wider than that of the larynx tube opening. If so, the larynx tube is acoustically mismatched with the rest of the vocal tract, and an extra formant is added to the vocal tract transfer function. (2) The sinus Morgagni must be wide in relation to the rest of the larynx tube. This may tune the frequency of the extra formant to a value between the frequencies of the third and fourth formants in normal speech. (3) The sinus piriformes must be wide. This reduces the frequency of the fifth formant to about 3 kHz. X-ray studies of a raised and lowered larynx showed that these three conditions may be fulfilled when the larynx is lowered. Thus, the larynx lowering, typical of male professional singing, seems to explain the "singing formant" and other formant frequency differences between normal speech and male professional singing.
In acoustic communication timing seems to be an exceedingly important aspect. The just noticeable difference ͑jnd͒ for small perturbations of an isochronous sequence of sounds is particularly important in music, in which such sequences frequently occur. This article reviews the literature in the area and presents an experiment designed to resolve some conflicting results in the literature regarding the tempo dependence for quick tempi and relevance of music experience. The jnd for a perturbation of the timing of a tone appearing in an isochronous sequence was examined by the method of adjustment. Thirty listeners of varied musical background were asked to adjust the position of the fourth tone in a sequence of six, such that they heard the sequence as perfectly isochronous. The tones were presented at a constant interonset time that was varied between 100 and 1000 ms. The absolute jnd was found to be approximately constant at 6 ms for tone interonset intervals shorter than about 240 ms and the relative jnd constant at 2.5% of the tone interonsets above 240 ms. Subjects' musical training did not affect these values. Comparison with previous work showed that a constant absolute jnd below 250 ms and constant relative jnd above 250 ms tend to appear regardless of the perturbation type, at least if the sequence is relatively short.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.