Computer audition (i.e., intelligent audio) has made great strides in recent years; however, it is still far from achieving holistic hearing abilities, which more appropriately mimic human-like understanding. Within an audio scene, a human listener is quickly able to interpret layers of sound at a single time-point, with each layer varying in characteristics such as location, state, and trait. Currently, integrated machine listening approaches, on the other hand, will mainly recognise only single events. In this context, this contribution aims to provide key insights and approaches, which can be applied in computer audition to achieve the goal of a more holistic intelligent understanding system, as well as identifying challenges in reaching this goal. We firstly summarise the state-of-the-art in traditional signal-processing-based audio pre-processing and feature representation, as well as automated learning such as by deep neural networks. This concerns, in particular, audio interpretation, decomposition, understanding, as well as ontologisation. We then present an agent-based approach for integrating these concepts as a holistic audio understanding system. Based on this, concluding, avenues are given towards reaching the ambitious goal of ‘holistic human-parity’ machine listening abilities.
Biometric identification techniques such as photo-identification require an array of unique natural markings to identify individuals. From 1975 to present, Bigg’s killer whales have been photo-identified along the west coast of North America, resulting in one of the largest and longest-running cetacean photo-identification datasets. However, data maintenance and analysis are extremely time and resource consuming. This study transfers the procedure of killer whale image identification into a fully automated, multi-stage, deep learning framework, entitled FIN-PRINT. It is composed of multiple sequentially ordered sub-components. FIN-PRINT is trained and evaluated on a dataset collected over an 8-year period (2011–2018) in the coastal waters off western North America, including 121,000 human-annotated identification images of Bigg’s killer whales. At first, object detection is performed to identify unique killer whale markings, resulting in 94.4% recall, 94.1% precision, and 93.4% mean-average-precision (mAP). Second, all previously identified natural killer whale markings are extracted. The third step introduces a data enhancement mechanism by filtering between valid and invalid markings from previous processing levels, achieving 92.8% recall, 97.5%, precision, and 95.2% accuracy. The fourth and final step involves multi-class individual recognition. When evaluated on the network test set, it achieved an accuracy of 92.5% with 97.2% top-3 unweighted accuracy (TUA) for the 100 most commonly photo-identified killer whales. Additionally, the method achieved an accuracy of 84.5% and a TUA of 92.9% when applied to the entire 2018 image collection of the 100 most common killer whales. The source code of FIN-PRINT can be adapted to other species and will be publicly available.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.