Recently, deep learning classifiers have proven even more robust in pattern recognition and classification than have texture analysis techniques. With the broad availability of relatively inexpensive Graphics Processing Units (GPUs), many researchers have begun applying deep learning techniques to visual representations of acoustic traces. Preselected or handcrafted descriptors, such as LBP, are not necessary for deep learners since they learn salient features during the training phase. Deep learners, moreover, are uniquely suited to handling visual representations of audio because many of the most famous deep classifiers, such as Convolutional Neural Networks (CNN), require matrices as their input. Humphrey and Bello [17, 18] were among the first to apply CNNs to audio images for music classification and, as a result, succeeded in redefining the state of the art in automatic chord detection and recognition. In the same year, Nakashika et al. [19] reported converting spectrograms to GCLM maps to train CNNs to performed music genre classification on the GTZAN dataset [20]. Later, Costa et al. [21] fused a CNN with the traditional pattern recognition framework of training SVMs on LBP features to classify the LMD dataset. These works exceeded traditional classification results on these genre datasets. Up to this point, most work in audio classification has applied the latest advances in machine learning to the problem of sound classification and recognition without modifying the classification process to make it singularly suitable for sound recognition. An early exception to the generic approach is found in the work of Sigtia and Dixon [22], who adjusted CNN parameters and structures in such a way as to reduce the time it took to train a set of audio images. Time reduction was accomplished by replacing
Image identification of animals is mostly centred on identifying them based on their appearance, but there are other ways images can be used to identify animals, including by representing the sounds they make with images. In this study, the authors present a novel and effective approach for automated identification of birds and whales using some of the best texture descriptors in the computer vision literature. The visual features of sounds are built starting from the audio file and are taken from images constructed from different spectrograms and from harmonic and percussion images. These images are divided into sub-windows from which sets of texture descriptors are extracted. The experiments reported in this study using a dataset of Bird vocalisations targeted for species recognition and a dataset of right whale calls targeted for whale detection (as well as three well-known benchmarks for music genre classification) demonstrate that the fusion of different texture features enhances performance. The experiments also demonstrate that the fusion of different texture features with audio features is not only comparable with existing audio signal approaches but also statistically improves some of the stand-alone audio features. The code for the experiments will be publicly available at https://www.dropbox.com/s/bguw035yrqz0pwp/ElencoCode.docx?dl=0.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.