BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations

Niizumi, Daisuke; Takeuchi, Daisuke; Ohishi, Yasunori; Harada, Noboru; Kashino, Kunio

doi:10.1109/taslp.2022.3221007

Cited by 23 publications

(26 citation statements)

References 64 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1) Linear Evaluation Results: Table IV shows the linear evaluation results on six tasks. For a fair comparison, we compare with other methods that also use Audioset for pretraining and have also reported the linear evaluation results in their papers, including TRILL [45], COLA [3], BYOL-A [5], BYOL-A-v2 [11], SF NFNET-F0 [46] and M2D [15]. The proposed ATST-Clip is developed based on BYOL-A and BYOL-A-V2, using a transformer encoder and a new view creation strategy.…”

Section: B Results On Clip-level Downstream Tasksmentioning

confidence: 99%

“…C. Results on Frame-level Downstream Task -Sound Event Detection 1) Comparison Methods: We compare with six SSL pretrained models: BYOL-A-v2 [11], SSAST [6], MAE-AST [7], Audio-MAE [9], BEATs [10] and M2D [15]. Sound event detection requires to perform frame-level multi-class classification.…”

Section: B Results On Clip-level Downstream Tasksmentioning

confidence: 99%

“…ATST-Clip requires two different randomly cropped segments and the two segments have only a certain portion of overlap, while ATST-Frame asks for a frameto-frame correspondence between the two views. Besides, ATST-Clip and other clip-level contrastive audio pre-training methods [31] [5] [11] largely leverage the RRC augmentation [5] to achieve a good performance, but RRC will distort the frame-to-frame correspondence.…”

Section: E Combine Atst-clip and Atst-framementioning

confidence: 99%

“…In terms of the training criterion, a portion of previous methods focus on learning global representation of an audio clip by using clip-level training criteria [3] [5] [11], while others propose learning local frame-wise or patch-wise representations by using frame-level [6] [7] or patch-level criteria [6] [7] [14] [15] [9] [10]. Most of the clip-level methods allow for extracting frame-wise representations from the intermediate output.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

ATST: Audio Representation Learning with Teacher-Student Transformer

Li¹,

Li²

2022

Interspeech 2022

View full text Add to dashboard Cite

In recent years, self-supervised learning (SSL) has emerged as a popular approach for learning audio representations. The ultimate goal of audio self-supervised pre-training is to transfer knowledge to downstream audio tasks, generally including clip-level and frame-level tasks. Clip-level tasks classify the scene or sound of an entire audio clip, e.g. audio tagging, instrument recognition, etc. While frame-level tasks detect eventlevel timestamps from an audio clip, e.g. sound event detection, speaker diarization, etc. Prior studies primarily evaluate on clip-level downstream tasks. Frame-level tasks are important for fine-grained acoustic scene/event understanding, and are generally more challenging than clip-level tasks. In order to tackle both clip-level and frame-level tasks, this paper proposes two self-supervised audio representation learning methods: ATST-Clip and ATST-Frame, responsible for learning clip-level and frame-level representations, respectively. ATST stands for Audio Teacher-Student Transformer, which means both methods use a transformer encoder and a teacher-student training scheme. Within the teacher-student training scheme, the key for learning meaningful representations is to create two different views (a positive pair) of an audio clip that can well balance the difficulty of pre-training task. We have carefully designed the view creation strategy for ATST-Clip and ATST-Frame. Specifically, ATST-Clip uses segment-wise data augmentation, and ATST-Frame integrates frame-wise data augmentation and masking. Experimental results show that our ATST-Frame model obtains state-of-the-art (SOTA) performance on most of the clip-level and frame-level downstream tasks. Especially, it outperforms other models by a large margin on the frame-level sound event detection task. In addition, the performance can be further improved by combining the two models through knowledge distillation.

show abstract

Section: B Results On Clip-level Downstream Tasksmentioning

confidence: 99%

Section: B Results On Clip-level Downstream Tasksmentioning

confidence: 99%

Section: E Combine Atst-clip and Atst-framementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ATST: Audio Representation Learning with Teacher-Student Transformer

Li¹,

Li²

2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…The eGeMAPS is a minimalistic set of acoustic Finally, we experiment with 4 different types of deep audio embeddings, i.e. VGGish [26], YAMNet, OpenL3 [27], and BYOL-A [28], which are state-of-the-art general audio features pretrained on large audio collections that are successfully used for a number of downstream tasks. Characteristics of different audio embeddings are provided in Table III.…”

Section: Feature Extraction and Fusionmentioning

confidence: 99%

Digital Voice-Based Biomarker for Monitoring Respiratory Quality of Life: Findings from the Colive Voice Study

Despotovic,

Elbéji,

Fünfgeld

et al. 2023

Preprint

View full text Add to dashboard Cite

Regular monitoring of respiratory quality of life (RQoL) is essential in respiratory healthcare, facilitating prompt diagnosis and tailored treatment for chronic respiratory diseases. Voice alterations resulting from respiratory conditions create unique audio signatures that can potentially be utilized for disease screening or monitoring. Analyzing data from 1908 participants from the Colive Voice study, which collects standardized voice recordings alongside comprehensive demographic, epidemiological, and patient-reported outcome data, we evaluated various strategies to estimate RQoL from voice, including handcrafted acoustic features, standard acoustic feature sets, and advanced deep audio embeddings derived from pretrained convolutional neural networks. We compared models using clinical features alone, voice features alone, and a combination of both. The multimodal model combining clinical and voice features demonstrated the best performance, achieving an accuracy of 70.34% and an area under the receiver operating characteristic curve (AUROC) of 0.77; an improvement of 5% in terms of accuracy and 7% in terms of AUROC compared to model utilizing voice features alone. Incorporating vocal biomarkers significantly enhanced the predictive capacity of clinical variables across all acoustic feature types, with a net classification improvement (NRI) of up to 0.19. Our digital voice-based biomarker is capable of accurately predicting RQoL, either as an alternative to or in conjunction with clinical measures, and could be used to facilitate rapid screening and remote monitoring of respiratory health status.

show abstract