The area of sound field synthesis has significantly advanced in the past decade, facilitated by the development of high-quality sound-field capturing and re-synthesis systems. Spherical microphone arrays are among the most recently developed systems for sound field capturing, enabling processing and analysis of three-dimensional sound fields in the spherical harmonics domain. In spite of these developments, a clear relation between sound fields recorded by spherical microphone arrays and their perception with a re-synthesis system has not yet been established, although some relation to scalar measures of spatial perception was recently presented. This paper presents an experimental study of spatial sound perception with the use of a spherical microphone array for sound recording and headphone-based binaural sound synthesis. Sound field analysis and processing is performed in the spherical harmonics domain with the use of head-related transfer functions and simulated enclosed sound fields. The effect of several factors, such as spherical harmonics order, frequency bandwidth, and spatial sampling, are investigated by applying the repertory grid technique to the results of the experiment, forming a clearer relation between sound-field capture with a spherical microphone array and its perception using binaural synthesis regarding space, frequency, and additional artifacts. The experimental study clearly shows that a source will be perceived more spatially sharp and more externalized when represented by a binaural stimuli reconstructed with a higher spherical harmonics order. This effect is apparent from low spherical harmonics orders. Spatial aliasing, as a result of sound field capturing with a finite number of microphones, introduces unpleasant artifacts which increased with the degree of aliasing error.
This paper is concerned with machine localisation of multiple active speech sources in reverberant environments using two (binaural) microphones. Such conditions typically present a problem for 'classical' binaural models. Inspired by the human ability to utilise head movements, the current study investigated the influence of different head movement strategies on binaural sound localisation. A machine-hearing system that exploits a multi-step head rotation strategy for sound localisation was found to produce the best performance in simulated reverberant acoustic space. This paper also reports the public release of a free binaural room impulse responses (BRIRs) database that allows the simulation of head rotation used in this study.
Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation, robustness, fairness, and efficiency. The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT that we fine-tuned on the dimensions arousal, dominance, and valence of MSP-Podcast, while additionally using IEMOCAP and MOSI to test cross-corpus generalisation. To the best of our knowledge, we obtain the top performance for valence prediction without use of explicit linguistic information, with a concordance correlation coefficient (CCC) of .638 on MSP-Podcast. Furthermore, our investigations reveal that transformer-based architectures are more robust to small perturbations compared to a CNN-based baseline and fair with respect to biological sex groups, but not towards individual speakers. Finally, we are the first to show that their extraordinary success on valence is based on implicit linguistic information learnt during finetuning of the transformer layers, which explains why they perform on-par with recent multimodal approaches that explicitly utilise textual information. Our findings collectively paint the following picture: transformer-based architectures constitute the new state-of-the-art in SER, but further advances are needed to mitigate remaining robustness and individual speaker issues. To make our findings reproducible, we release the best performing model to the community.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.