“…In recent years, we have witnessed a great progress in the acoustic scene classification (ASC) task, as demonstrated by the high participation in the IEEE Detection and Classification of Acoustic Scenes and Events (DCASE) challenges [1,2,3]. Top ASC systems use deep neural networks (DNNs), and the main ingredient of their success is the application of deep convolutional neural networks (CNNs) [4,5,6,7,8,9]. Further boost in ASC performance is obtained with the introduction of advanced deep learning techniques, such as attention mechanism [10,11,12], mix-up [13,14], Generative Adversial Network (GAN) and Variational Auto Encoder (VAE) based data augmentation [15,16], and deep feature learning [17,18,19,20].…”