“…Convolutional neural network (CNNs) [6,7,8,5,4,9,10,11] are most commonly used deep learning techniques, thanks to their feature learning capability. Leveraging the nature of audio signals, sequential modelling with recurrent neural networks (RNNs) [2,12,13] and temporal transformer networks [10] have also demonstrated results on par with the convolutional counterparts. The deep learning models were trained either on time-frequency representations, such as log Mel-scale spectrograms [4,11], or highlevel features like label tree embedding features [6,2].…”