The hike in the demand for smart cities has gathered the interest of researchers to work on environmental sound classification. Most researchers' goal is to reach the Bayesian optimal error in the field of audio classification. Nonetheless, it is very baffling to interpret meaning from a three-dimensional audio and this is where different types of spectrograms become effective. Using benchmark spectral features such as mel frequency cepstral coefficients (MFCCs), chromagram, log-mel spectrogram (LM), and so on audio can be converted into meaningful 2D spectrograms. In this paper, we propose a convolutional neural network (CNN) model, which is fabricated with additive angular margin loss (AAML), large margin cosine loss (LMCL) and a-softmax loss. These loss functions proposed for face recognition, hold their value in the other fields of study if they are implemented in a systematic manner. The mentioned loss functions are more dominant than conventional softmax loss when it comes to classification task because of its capability to increase intra-class compactness and interclass discrepancy. Thus, with MCAAM-Net, MCAS-Net and MCLCM-Net models, a classification accuracy of 99.60%, 99.43% and 99.37% is achieved on UrbanSound8K dataset respectively without any augmentation. This paper also demonstrates the benefit of stacking features together and the above-mentioned validation accuracies are achieved after stacking MFCCs and chromagram on the x-axis. We also visualized the clusters formed by the embedded vectors of test data for further acknowledgement of our results, after passing it through different proposed models. Finally, we show that the MCAAM-Net model achieved an accuracy of 99.60% on UrbanSound8K dataset, which outperforms the benchmark models like TSCNN-DS, ADCNN-5, ESResNet-Attention, and so on that are introduced over the recent years.additive angular margin, angular softmax, chromagram, large margin cosine, mel frequency cepstral coefficient, smart city
| INTRODUCTIONOver the years, computers are given tortuous work in the field of deep learning (DL) and surprisingly computers did not disappoint, even subduing human capabilities at times. Along with advancement in the field of computer vision (CV), analysis of audio recognition has also being thriving (Dang et al., 2020). However, speech recognition and music processing has been the focal points in the province of audio analysis (Piczak, 2015b).But in the last few years, environmental sound classification (ESC) have been in the spotlight with the growth of the smart city applications and