Audio Albert: A Lite Bert for Self-Supervised Learning of Audio Representation

Chi, Po-Han; Chung, Pei-Hung; Wu, Tsung-Han; Hsieh, Chun-Cheng; Chen, Yen‐Hao; Li, Shang-Wen; Lee, Hung-yi

doi:10.1109/slt48900.2021.9383575

Cited by 126 publications

(54 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The typical usage of this dataset is for ASR (Huang et al, 2020;Zhang et al, 2020). It could also be used for selfsupervised training (Chi et al, 2020;Liu et al, 2020), and transfer to the downstream task like phoneme classification, speaker recognition, and sentiment classification.…”

Section: Speech Datasetmentioning

confidence: 99%

Exploring Deep Transfer Learning Techniques for Alzheimer's Dementia Detection

Zhu¹,

Liang²,

Batsis³

et al. 2021

Front. Comput. Sci.

View full text Add to dashboard Cite

Examination of speech datasets for detecting dementia, collected via various speech tasks, has revealed links between speech and cognitive abilities. However, the speech dataset available for this research is extremely limited because the collection process of speech and baseline data from patients with dementia in clinical settings is expensive. In this paper, we study the spontaneous speech dataset from a recent ADReSS challenge, a Cookie Theft Picture (CTP) dataset with balanced groups of participants in age, gender, and cognitive status. We explore state-of-the-art deep transfer learning techniques from image, audio, speech, and language domains. We envision that one advantage of transfer learning is to eliminate the design of handcrafted features based on the tasks and datasets. Transfer learning further mitigates the limited dementia-relevant speech data problem by inheriting knowledge from similar but much larger datasets. Specifically, we built a variety of transfer learning models using commonly employed MobileNet (image), YAMNet (audio), Mockingjay (speech), and BERT (text) models. Results indicated that the transfer learning models of text data showed significantly better performance than those of audio data. Performance gains of the text models may be due to the high similarity between the pre-training text dataset and the CTP text dataset. Our multi-modal transfer learning introduced a slight improvement in accuracy, demonstrating that audio and text data provide limited complementary information. Multi-task transfer learning resulted in limited improvements in classification and a negative impact in regression. By analyzing the meaning behind the Alzheimer's disease (AD)/non-AD labels and Mini-Mental State Examination (MMSE) scores, we observed that the inconsistency between labels and scores could limit the performance of the multi-task learning, especially when the outputs of the single-task models are highly consistent with the corresponding labels/scores. In sum, we conducted a large comparative analysis of varying transfer learning models focusing less on model customization but more on pre-trained models and pre-training datasets. We revealed insightful relations among models, data types, and data labels in this research area.

show abstract

Section: Speech Datasetmentioning

confidence: 99%

Exploring Deep Transfer Learning Techniques for Alzheimer's Dementia Detection

Zhu¹,

Liang²,

Batsis³

et al. 2021

Front. Comput. Sci.

View full text Add to dashboard Cite

show abstract

“…The motivation for pretraining data with MulT is to capture and model temporal dependencies so we also want the base features to be temporally independent. Thus, even though features extracted from pretrained Speech Transformers such as [3,15,16] are powerful, they are not suitable to be base features for MulT.…”

Section: Feature Selectionmentioning

confidence: 99%

A Pre-trained Audio-Visual Transformer for Emotion Recognition

Tran¹,

Soleymani²

2022

Preprint

View full text Add to dashboard Cite

In this paper, we introduce a pretrained audio-visual Transformer trained on more than 500k utterances from nearly 4000 celebrities from the VoxCeleb2 dataset for human behavior understanding. The model aims to capture and extract useful information from the interactions between human facial and auditory behaviors, with application in emotion recognition. We evaluate the model performance on two datasets, namely CREMAD-D (emotion classification) and MSP-IMPROV (continuous emotion regression). Experimental results show that fine-tuning the pre-trained model helps improving emotion classification accuracy by 5-7% and Concordance Correlation Coefficients (CCC) in continuous emotion recognition by 0.03-0.09 compared to the same model trained from scratch. We also demonstrate the robustness of finetuning the pre-trained model in a low-resource setting. With only 10% of the original training set provided, finetuning the pre-trained model can lead to at least 10% better emotion recognition accuracy and a CCC score improvement by at least 0.1 for continuous emotion recognition.

show abstract

“…Such representations, computed by neural models trained on huge amounts of unlabeled data, have shown their effectiveness on some tasks under certain conditions, for instance in ASR [45], [46], or speech translation [47]. Recently Wav2Vec [48], Mockingjay [46] and Audio Al-BERT [49] were introduced in ASR and speaker identification as one of the first pre-trained approaches to extract context dependent features from raw signals for ASR tasks but they have not been used for SER yet. Very recently a BERT-like model for French has been developed [50].…”

Section: Pre-trained Features For Nlpmentioning

confidence: 99%

Mutual impact of acoustic and linguistic representations for continuous emotion recognition in call-center conversations

Tahon¹,

Macary²,

Estève³

et al. 2021

Preprint

View full text Add to dashboard Cite

<div> <div> <div> <p>The goal of our research is to automaticaly retrieve the satisfaction and the frustration in real-life call-center conversations. This study focuses an industrial application in which the customer satisfaction is continuously tracked down to improve customer services. To compensate the lack of large annotated emotional databases, we explore the use of pre-trained speech representations as a form of transfer learning towards AlloSat corpus. Moreover, several studies have pointed out that emotion can be detected not only in speech but also in facial trait, in biological response or in textual information. In the context of telephone conversations, we can break down the audio information into acoustic and linguistic by using the speech signal and its transcription. Our experiments confirms the large gain in performance obtained with the use of pre-trained features. Surprisingly, we found that the linguistic content is clearly the major contributor for the prediction of satisfaction and best generalizes to unseen data. Our experiments conclude to the definitive advantage of using CamemBERT representations, however the benefit of the fusion of acoustic and linguistic modalities is not as obvious. With models learnt on individual annotations, we found that fusion approaches are more robust to the subjectivity of the annotation task. This study also tackles the problem of performances variability and intends to estimate this variability from different views: weights initialization, confidence intervals and annotation subjectivity. A deep analysis on the linguistic content investigates interpretable factors able to explain the high contribution of the linguistic modality for this task. </p> </div> </div> </div>

show abstract

Audio Albert: A Lite Bert for Self-Supervised Learning of Audio Representation

Cited by 126 publications

References 15 publications

Exploring Deep Transfer Learning Techniques for Alzheimer's Dementia Detection

Exploring Deep Transfer Learning Techniques for Alzheimer's Dementia Detection

A Pre-trained Audio-Visual Transformer for Emotion Recognition

Mutual impact of acoustic and linguistic representations for continuous emotion recognition in call-center conversations

Contact Info

Product

Resources

About