After its sweeping success in vision and language tasks, pure attention-based neural architectures (e.g. DeiT) [1] are emerging to the top of audio tagging (AT) leaderboards [2], which seemingly obsoletes traditional convolutional neural networks (CNNs), feed-forward networks or recurrent networks. However, taking a closer look, there is great variability in published research, for instance, performances of models initialized with pretrained weights differ drastically from without pretraining [2], training time for a model varies from hours to weeks, and often, essences are hidden in seemingly trivial details.This urgently calls for a comprehensive study since our 1st comparison [3] is half-decade old. In this work, we perform extensive experiments on AudioSet [4] which is the largest weakly-labeled sound event dataset available, we also did analysis based on the data quality and efficiency. We compare a few state-of-the-art baselines on the AT task, and study the performance and efficiency of 2 major categories of neural architectures: CNN variants and attention-based variants. We also closely examine their optimization procedures. Our opensourced experimental results 1 provide insights to trade off between performance, efficiency, optimization process, for both practitioners and researchers. 2
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.