SignificanceIn speech, social evaluations of a speaker’s dominance or trustworthiness are conveyed by distinguishing, but little-understood, pitch variations. This work describes how to combine state-of-the-art vocal pitch transformations with the psychophysical technique of reverse correlation and uses this methodology to uncover the prosodic prototypes that govern such social judgments in speech. This finding is of great significance, because the exact shape of these prototypes, and how they vary with sex, age, and culture, is virtually unknown, and because prototypes derived with the method can then be reapplied to arbitrary spoken utterances, thus providing a principled way to modulate personality impressions in speech.
Abstract-We present a computational model of musical instrument sounds that focuses on capturing the dynamic behavior of the spectral envelope. A set of spectro-temporal envelopes belonging to different notes of each instrument are extracted by means of sinusoidal modeling and subsequent frequency interpolation, before being subjected to principal component analysis. The prototypical evolution of the envelopes in the obtained reduced-dimensional space is modeled as a nonstationary Gaussian Process. This results in a compact representation in the form of a set of prototype curves in feature space, or equivalently of prototype spectro-temporal envelopes in the time-frequency domain. Finally, the obtained models are successfully evaluated in the context of two music content analysis tasks: classification of instrument samples and detection of instruments in monaural polyphonic mixtures.Index Terms-Gaussian processes, music information retrieval (MIR), sinusoidal modeling, spectral envelope, timbre model.
Over the past few years, the field of visual social cognition and face processing has been dramatically impacted by a series of data-driven studies employing computer-graphics tools to synthesize arbitrary meaningful facial expressions. In the auditory modality, reverse correlation is traditionally used to characterize sensory processing at the level of spectral or spectro-temporal stimulus properties, but not higher-level cognitive processing of e.g. words, sentences or music, by lack of tools able to manipulate the stimulus dimensions that are relevant for these processes. Here, we present an open-source audio-transformation toolbox, called CLEESE, able to systematically randomize the prosody/melody of existing speech and music recordings. CLEESE works by cutting recordings in small successive time segments (e.g. every successive 100 milliseconds in a spoken utterance), and applying a random parametric transformation of each segment’s pitch, duration or amplitude, using a new Python-language implementation of the phase-vocoder digital audio technique. We present here two applications of the tool to generate stimuli for studying intonation processing of interrogative vs declarative speech, and rhythm processing of sung melodies.
Over the past few years, the field of visual social cognition and face processing has been 1 dramatically impacted by a series of data-driven studies employing computer-graphics 2 tools to synthesize arbitrary meaningful facial expressions. In the auditory modality, 3 reverse correlation is traditionally used to characterize sensory processing at the level of 4 spectral or spectro-temporal stimulus properties, but not higher-level cognitive 5 processing of e.g. words, sentences or music, by lack of tools able to manipulate the 6 stimulus dimensions that are relevant for these processes. Here, we present an 7 open-source audio-transformation toolbox, called CLEESE, able to systematically 8 randomize the prosody/melody of existing speech and music recordings. CLEESE works 9 by cutting recordings in small successive time segments (e.g. every successive 100 10 milliseconds in a spoken utterance), and applying a random parametric transformation 11 of each segment's pitch, duration or amplitude, using a new Python-language 12 implementation of the phase-vocoder digital audio technique. We present here two 13 applications of the tool to generate stimuli for studying intonation processing of 14 interrogative vs declarative speech, and rhythm processing of sung melodies.15 20 facial metrics such as width-to-height ratio, acoustical features such as mean pitch) are 21 posited by the experimenter before being controlled or tested experimentally, which may 22 create a variety of confirmation biases or experimental demands. For instance, stimuli 23 constructed to display western facial expressions of happiness or sadness may well be 24 recognized as such by non-western observers [1], but yet may not be the way these 25 emotions are spontaneously produced, or internally represented, in such cultures [2]. 26Similarly in auditory cognition, musical stimuli recorded by experts pressed to express 27 emotions in music may do so by mimicking expressive cues used in speech, but these 28 cues may not exhaust the many other ways in which arbitrary music can express 29 emotions [3]. For all these reasons, in recent years, a series of powerful data-driven 30 PLOS1/17 33 systematically-varied stimuli [5]. 34 The reverse correlation technique was first introduced in neurophysiology to 35 characterize neuronal receptive fields of biological systems with so-called "white noise 36 analysis" [6-9]. In psychophysics, the technique was then adapted to characterize 37 human sensory processes, taking behavioral choices (e.g., yes/no responses) instead of 38 neuronal spikes as the systems' output variables to study, e.g. in the auditory domain, 39 detection of tones in noise [10] or loudness weighting in tones and noise ( [11]; see [4] for 40 a review of similar applications in vision). In the visual domain, these techniques have 41 been extended in recent years to address not only low-level sensory processes, but 42 higher-level cognitive mechanisms in humans: facial recognition [12], emotional 43 expressions [2, 13], social traits [14], as well a...
Sparsity-based source separation algorithms often rely on a transformation into a sparse domain to improve mixture disjointness and therefore facilitate separation. To this end, the most commonly used time-frequency representation has been the Short Time Fourier Transform (STFT). The purpose of this paper is to study the use of auditory-based representations instead of the STFT. We first evaluate the STFT disjointness properties for the case of speech and music signals, and show that auditory representations based on the Equal Rectangular Bandwidth (ERB) and Bark frequency scales can improve the disjointness of the transformed mixtures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.