2023
DOI: 10.1109/taslp.2022.3221007
|View full text |Cite
|
Sign up to set email alerts
|

BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations

Abstract: Pre-trained models are essential as feature extractors in modern machine learning systems in various domains. In this study, we hypothesize that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound. For recognizing sounds regardless of perturbations such as varying pitch or timbre, features should be robust to these perturbations. For serving the diverse needs of tasks such as recognition of emotions or music genres, representations should prov… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
24
1

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 23 publications
(26 citation statements)
references
References 64 publications
1
24
1
Order By: Relevance
“…1) Linear Evaluation Results: Table IV shows the linear evaluation results on six tasks. For a fair comparison, we compare with other methods that also use Audioset for pretraining and have also reported the linear evaluation results in their papers, including TRILL [45], COLA [3], BYOL-A [5], BYOL-A-v2 [11], SF NFNET-F0 [46] and M2D [15]. The proposed ATST-Clip is developed based on BYOL-A and BYOL-A-V2, using a transformer encoder and a new view creation strategy.…”
Section: B Results On Clip-level Downstream Tasksmentioning
confidence: 99%
See 3 more Smart Citations
“…1) Linear Evaluation Results: Table IV shows the linear evaluation results on six tasks. For a fair comparison, we compare with other methods that also use Audioset for pretraining and have also reported the linear evaluation results in their papers, including TRILL [45], COLA [3], BYOL-A [5], BYOL-A-v2 [11], SF NFNET-F0 [46] and M2D [15]. The proposed ATST-Clip is developed based on BYOL-A and BYOL-A-V2, using a transformer encoder and a new view creation strategy.…”
Section: B Results On Clip-level Downstream Tasksmentioning
confidence: 99%
“…C. Results on Frame-level Downstream Task -Sound Event Detection 1) Comparison Methods: We compare with six SSL pretrained models: BYOL-A-v2 [11], SSAST [6], MAE-AST [7], Audio-MAE [9], BEATs [10] and M2D [15]. Sound event detection requires to perform frame-level multi-class classification.…”
Section: B Results On Clip-level Downstream Tasksmentioning
confidence: 99%
See 2 more Smart Citations
“…The eGeMAPS is a minimalistic set of acoustic Finally, we experiment with 4 different types of deep audio embeddings, i.e. VGGish [26], YAMNet, OpenL3 [27], and BYOL-A [28], which are state-of-the-art general audio features pretrained on large audio collections that are successfully used for a number of downstream tasks. Characteristics of different audio embeddings are provided in Table III.…”
Section: Feature Extraction and Fusionmentioning
confidence: 99%