Robustness against temporal variations is important for emotion recognition from speech audio, since emotion is expressed through complex spectral patterns that can exhibit significant local dilation and compression on the time axis depending on speaker and context. To address this and potentially other tasks, we introduce the multi-time-scale (MTS) method to create flexibility towards temporal variations when analyzing time-frequency representations of audio data. MTS extends convolutional neural networks with convolution kernels that are scaled and re-sampled along the time axis, to increase temporal flexibility without increasing the number of trainable parameters compared to standard convolutional layers. We evaluate MTS and standard convolutional layers in different architectures for emotion recognition from speech audio, using 4 datasets of different sizes. The results show that the use of MTS layers consistently improves the generalization of networks of different capacity and depth, compared to standard convolution, especially on smaller datasets.
Visual-to-auditory sensory substitution devices (SSDs) provide improved access to the visual environment for the visually impaired by converting images into auditory information. Research is lacking on the mechanisms involved in processing data that is perceived through one sensory modality, but directly associated with a source in a different sensory modality. This is important because SSDs that use auditory displays could involve binaural presentation requiring both ear canals, or monaural presentation requiring only onebut which ear would be ideal? SSDs may be similar to reading, as an image (printed word) is converted into sound (when read aloud). Reading, and language more generally, are typically lateralised to the left cerebral hemisphere. Yet, unlike symbolic written language, SSDs convert images to sound based on visuospatial properties, with the right cerebral hemisphere potentially having a role in processing such visuospatial data. Here we investigated whether there is a hemispheric bias in the processing of visual-to-auditory sensory substitution information and whether that varies as a function of experience and visual ability. We assessed the lateralization of auditory processing with two tests: a standard dichotic listening test and a novel dichotic listening test created using the auditory information produced by an SSD, The vOICe. Participants were tested either in the lab or online with the same stimuli. We did not find a hemispheric bias in the processing of visualto-auditory information in visually impaired, experienced vOICe users. Further, we did not find any difference between visually impaired, experienced vOICe users and sighted novices in the hemispheric lateralization of visual-to-auditory information processing. Although standard dichotic listening is lateralised to the left hemisphere, the auditory processing of images in SSDs is bilateral, possibly due to the increased influence of right hemisphere processing. Auditory SSDs might therefore be equally effective with presentation to either ear if a monaural, rather than binaural, presentation were necessary.
Visual-to-auditory sensory substitution devices (SSDs) provides improved access to the visual environment for the visually impaired by converting images into auditory information. Research is lacking on the mechanisms involved in processing data that is perceived through one sensory modality, but directly associated with a source in a different sensory modality. SSDs may be similar to reading, as an image (printed word) is converted into sound (when read aloud). Reading, and language more generally, are typically lateralised to the left cerebral hemisphere. Yet, unlike symbolic written language, SSDs convert images to sound based on visuospatial properties, with the right cerebral hemisphere potentially having a role in processing such visuospatial data. Here we investigated whether there is a hemispheric bias in the processing of visual-to-auditory sensory substitution information and whether that varies as a function of expertise and visual ability. We assessed the lateralisation of auditory processing with two tests: a standard dichotic listening test and a novel dichotic listening test created using the auditory information produced by an SSD, The vOICe. Although standard dichotic listening is lateralised to the left hemisphere, the auditory processing of images in SSDs is bilateral, possibly due to the increased influence of right hemisphere processing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.