2021 International Joint Conference on Neural Networks (IJCNN) 2021
DOI: 10.1109/ijcnn52387.2021.9534474
|View full text |Cite
|
Sign up to set email alerts
|

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

Abstract: Pre-trained models are essential as feature extractors in modern machine learning systems in various domains. In this study, we hypothesize that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound. For recognizing sounds regardless of perturbations such as varying pitch or timbre, features should be robust to these perturbations. For serving the diverse needs of tasks such as recognition of emotions or music genres, representations should prov… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
71
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 99 publications
(88 citation statements)
references
References 40 publications
0
71
0
Order By: Relevance
“…On the other hand, hybrid representation models, which aim to take advantage of both data-driven and handcrafted features, yielded significantly better performance. The two proposed hybrid models, one using BYOL-A's CNN encoder [19] and one with CvT encoding [37], consistently outperformed their corresponding BYOL-S models. This result suggests that adding the DSP-based supervision to the self-supervised representation learning framework helped to improve its generalization capacity in cognitive/physical load detection tasks.…”
Section: Resultsmentioning
confidence: 90%
See 2 more Smart Citations
“…On the other hand, hybrid representation models, which aim to take advantage of both data-driven and handcrafted features, yielded significantly better performance. The two proposed hybrid models, one using BYOL-A's CNN encoder [19] and one with CvT encoding [37], consistently outperformed their corresponding BYOL-S models. This result suggests that adding the DSP-based supervision to the self-supervised representation learning framework helped to improve its generalization capacity in cognitive/physical load detection tasks.…”
Section: Resultsmentioning
confidence: 90%
“…As in [18], we use the features from layer 19 as more effective than other layers of the model. BYOL-A -Unlike contrastive learning ferameorks, Bootstrap Your Own Latent for Audio (BYOL-A) [19] generates audio representations using two augmented views of a single audio sample, inspired by the success of BYOL [36] for image representation. To obtain audio representations, the log-mel spectrogram (LMS) of an input audio sample is first fed to a data augmentation module, yielding two randomly augmented copies of the input LMS (Figure 1).…”
Section: Data-driven Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…Self-supervised learning in speech and audio aims towards learning representations that contain high-level information from acoustic signals which can be further used in diverse sets of downstream tasks. Model weights learned through selfsupervision are either used as feature extractors under the linear evaluation protocol [27], [33] or used together with transfer learning for end-to-end fine-tuning with an added prediction-head for the downstream task [3], [7], [21]. Features learned through self-supervised speech representation learning have already proven to outperform other low-level features such as filter-banks and mel-frequency cepstral coefficients (MFCCs).…”
Section: Introductionmentioning
confidence: 99%
“…Moreover, [3] show that the quality of mined negative samples can affect the performance. On the other hand, the method proposed by [33] uses a momentum encoder, where a moving-average network is used to produce prediction targets for optimizing the MSE loss between two batches of augmented samples of the same audio segments. However, the symmetry-breaking network design has been found crucial for this approach.…”
Section: Introductionmentioning
confidence: 99%