Although deep learning models are state-of-the-art models in audio classification, they fall short when applied in developmental robotic settings and human–robot interaction (HRI). The major drawback is that deep learning relies on supervised training with a large amount of data and annotations. In contrast, developmental learning strategies in human–robot interaction often deal with small-scale data acquired from HRI experiments and require the incremental addition of novel classes. Alternatively, shallow learning architectures that enable fast and yet robust learning are provided by simple distance metric-based learning and neural architectures implementing the reservoir computing paradigm. Similarly, continual learning algorithms receive more attention in the last years as they can integrate stable perceptual feature extraction using pre-trained deep learning models with open-set classification. As our research centers around reenacting the incremental learning of audio cues, we conducted a study on environmental sound classification using the iCaRL as well as the GDumb continual learning algorithms in comparison with a popular classifier in this domain, the kNN classifier, as well as employing an Echo State Network. We contrast our results with those obtained from a VGGish network that serves here as the performance upper bound that allows us to quantify the performance differences and to discuss current issues with continual learning in the audio domain. As only little is known about using shallow models or continual learning in the audio domain, we pass on additional techniques like data augmentation and create a simple experimental pipeline that is easy to reproduce. Although our selected algorithms are partially inferior in performance compared to the upper bound, our evaluation on three environmental sound datasets shows promising performance using continual learning for a subset of the DCASE2019 challenge dataset and the ESC10 dataset. As we do not address benchmarking in this paper, our study provides a good foundation for further research and computational improvements on shallow and continual learning models for robotic applications in the audio domain.
One of the fundamental prerequisites for effective collaborations between interactive partners is the mutual sharing of the attentional focus on the same perceptual events. This is referred to as joint attention. In psychological, cognitive, and social sciences, its defining elements have been widely pinpointed. Also the field of human-robot interaction has extensively exploited joint attention which has been identified as a fundamental prerequisite for proficient human-robot collaborations. However, joint attention between robots and human partners is often encoded in prefixed robot behaviours that do not fully address the dynamics of interactive scenarios. We provide autonomous attentional behaviour for robotics based on a multi-sensory perception that robustly relocates the focus of attention on the same targets the human partner attends. Further, we investigated how such joint attention between a human and a robot partner improved with a new biologically-inspired memory-based attention component. We assessed the model with the humanoid robot iCub involved in performing a joint task with a human partner in a real-world unstructured scenario. The model showed a robust performance on capturing the stimulation, making a localisation decision in the right time frame, and then executing the right action. We then compared the attention performance of the robot against the human performance when stimulated from the same source across different modalities (audio-visual and audio only). The comparison showed that the model is behaving with temporal dynamics compatible with those of humans. This provides an effective solution for memory-based joint attention in real-world unstructured environments. Further, we analyzed the localisation performances (reaction time and accuracy), the results showed that the robot performed better in an audio-visual condition than an audio only condition. The performance of the robot in the audio-visual condition was relatively comparable with the behaviour of the human participants whereas it was less efficient in audio-only localisation. After a detailed analysis of the internal components of the architecture, we conclude that the differences in performance are due to egonoise which significantly affects the audio-only localisation performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.