2022
DOI: 10.48550/arxiv.2211.04772
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

Abstract: Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as AudioSet. However, Transformers are demanding in terms of model size and computational requirements compared to CNNs. We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from highperforming yet complex transformers. The proposed traini… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 14 publications
0
3
0
Order By: Relevance
“…In conclusion, the Transformer model had the following advantages over the convolutional models: firstly, it constructed long-distance feature relationships, and the Transformer used the attention mechanism to obtain the contextual information of the feature map, which made up for the slow process of expanding from the local features to the global features through layer-by-layer down sampling operations [ 60 , 61 , 62 ]. Secondly, it had the capability of multimodal fusion.…”
Section: Related Workmentioning
confidence: 99%
“…In conclusion, the Transformer model had the following advantages over the convolutional models: firstly, it constructed long-distance feature relationships, and the Transformer used the attention mechanism to obtain the contextual information of the feature map, which made up for the slow process of expanding from the local features to the global features through layer-by-layer down sampling operations [ 60 , 61 , 62 ]. Secondly, it had the capability of multimodal fusion.…”
Section: Related Workmentioning
confidence: 99%
“…We extract sound tags to provide more context. We use an audio tagging model (Schmid et al, 2022) to classify the entire audio stream. We select the top 3 predicted tags that have a higher confidence value than the threshold (0.3).…”
Section: Visual Descriptions and Utterances (Chronologically)mentioning
confidence: 99%
“…Video-to-Text Prompting. During the prompting stage, we use BLIP-2 (Li et al, 2023a), Intern-Video (Wang et al, 2022a), Whisper, ChatGPT (OpenAI, 2023, and an audio-tagging model from Schmid et al (2022). We use the coco-pretrained BLIP-2 model with nucleus sampling.…”
Section: A Experimental Detailsmentioning
confidence: 99%