SUPERB: Speech Processing Universal PERformance Benchmark

Yang, Shu-Wen; Chi, Po-Han; Chuang, Yung-Sung; Lai, Cheng-I; Lakhotia, Kushal; Lin, Yist Y.; Liu, Andy T.; Shi, Jiatong; Chang, Xuankai; Lin, Gwo‐Fong; Huang, Tsun‐Sheng; Tseng, Wei-Cheng; Lee, Ko-tik; Liu, Da-Rong; Huang, Zili; Dong, Shuyan; Li, Shang-Wen; Watanabe, Shinji; Mohamed, Abdelrahman; Lee, Hung-yi

doi:10.21437/interspeech.2021-1775

Cited by 374 publications

(183 citation statements)

References 0 publications

Supporting

Mentioning

146

Contrasting

Order By: Relevance

“…Model L FT-SR FT-D UAR WAR Yang et al [29] w2v2-L 65.6 10 Pepino et al [30] w2v2-b 67.2 11 Wang et al [23] hubert-L 67.6 12 Yang et al [29] hubert-L 67.6 13 Chen and Rudnicky [31] w2v2-b 69.9 14 Makiuchi et al [32] w2v2-L 70.7 15 Wang et al [23] w2v2-b 73.8 16 Chen and Rudnicky [31] w2v2-b 74.3 17 Wang et al [23] hubert-b 76.6 18 Wang et al [23] w2v2-L 76.8 19 Wang et al [23] w2v2-b 77.0 20 Wang et al [23] w2v2-L 77.5 21 Wang et al [23] hubert-L 79.0 22 Wang et al [23] hubert-L 79.6 call (WAR) on the four emotional categories of anger (1103 utterances), happiness (1636), sadness (1084), and neutral (1708), which is the typical categorical SER formulation for IEMOCAP. Since we are dealing with an unbalanced class problem, UAR and WAR can diverge.…”

Section: Workmentioning

confidence: 99%

Dawn of the transformer era in speech emotion recognition: closing the valence gap

Wagner¹,

Triantafyllopoulos²,

Wierstorf³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation, robustness, fairness, and efficiency. The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT that we fine-tuned on the dimensions arousal, dominance, and valence of MSP-Podcast, while additionally using IEMOCAP and MOSI to test cross-corpus generalisation. To the best of our knowledge, we obtain the top performance for valence prediction without use of explicit linguistic information, with a concordance correlation coefficient (CCC) of .638 on MSP-Podcast. Furthermore, our investigations reveal that transformer-based architectures are more robust to small perturbations compared to a CNN-based baseline and fair with respect to biological sex groups, but not towards individual speakers. Finally, we are the first to show that their extraordinary success on valence is based on implicit linguistic information learnt during finetuning of the transformer layers, which explains why they perform on-par with recent multimodal approaches that explicitly utilise textual information. Our findings collectively paint the following picture: transformer-based architectures constitute the new state-of-the-art in SER, but further advances are needed to mitigate remaining robustness and individual speaker issues. To make our findings reproducible, we release the best performing model to the community.

show abstract

Section: Workmentioning

confidence: 99%

Dawn of the transformer era in speech emotion recognition: closing the valence gap

Wagner¹,

Triantafyllopoulos²,

Wierstorf³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Recent work shows [27] that speech pre-trained models can solve full stack speech processing tasks, because the model utilizes bottom layers to learn speaker related information and top layers to encode content related information. We evaluate the proposed method on the SUPERB benchmark [19], 3 https://dl.fbaipublicfiles.com/hubert/hubert large ll60k.pt and https://dl.fbaipublicfiles.com/hubert/hubert large ll60k finetune ls960.pt 4 https://github.com/CorentinJ/librispeech-alignments 5 The results for HuBERT evaluation is slightly different from the reported number in their paper since we use different phonetic force-alignment tool. 3 compares ILS-SSL BASE and HuBERT BASE on SUPERB, indicating the ILS-SSL is better than HuBERT on content and semantic related tasks, while the performance degradation is observed for the speaker related tasks.…”

Section: Evaluation On Non-asr Tasksmentioning

confidence: 94%

“…In HuBERT [13], the authors run k-means clustering on representations of each of 12 blocks of a BASE model and show that cluster assignments from blocks 5-12 have higher mutual information score with force-aligned phonetic transcripts. Furthermore, [19] collect representations from different HuBERT layers and weighted sum them for various downstream tasks where the weights can be learned. The model tends to assign larger weights to top layers for phoneme recognition task, while assign larger weights to bottom layers for speaker-related tasks.…”

Section: Introductionmentioning

confidence: 99%

“…We find that ILS-SSL significantly improves the phonetic information learning for the bottom layers of the model. We also evaluate our model on the SUPERB benchmark [19] which includes ten different downstream tasks in four aspects of speech: content, speaker, semantics, and paralinguistics. The results also indicate our model is good at content and semantic related tasks, and not good at speaker related tasks.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Self-Supervised Learning for speech recognition with Intermediate layer supervision

Wang¹,

Wu²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recently, pioneer work finds that speech pre-trained models can solve full stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information. Since the network capacity is limited, we believe the speech recognition performance could be further improved if the model is dedicated to audio content information learning. To this end, we propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL), which forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly, which achieves a 23.5%/11.6% relative word error rate reduction in the w/o language model setting for base/large models. Detailed analysis shows the bottom layers of our model have a better correlation with phonetic units, which is consistent with our intuition and explains the success of our method for ASR. We will release our code and model at https://github.com/microsoft/UniSpeech.

show abstract

“…Motivated by (i) its proven application to the learning of general neural representations for a range of different tasks [11][12][13][14][15][16][17][18][19], (ii) evidence that fine-tuning with modest quantities of labelled data leads to state-of-the-art results, (iii) encouraging, previously reported results for anti-spoofing [20,21] and (iv) the appeal of one-class classification approaches [22], we have explored the use of self-supervised learning to improve generalisation. Our hypothesis is that better representations trained on diverse speech data, even those learned for other tasks and initially using only bona fide data (hence one-class), may help to reduce over-fitting and hence improve reliability and domainrobustness, particularly in the face of previously unseen spoofing attacks.…”

Section: Introductionmentioning

confidence: 99%

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation

Tak,

Todisco,

Wang

et al. 2022

Preprint

View full text Add to dashboard Cite

The performance of spoofing countermeasure systems depends fundamentally upon the use of sufficiently representative training data. With this usually being limited, current solutions typically lack generalisation to attacks encountered in the wild. Strategies to improve reliability in the face of uncontrolled, unpredictable attacks are hence needed. We report in this paper our efforts to use self-supervised learning in the form of a wav2vec 2.0 front-end with fine tuning. Despite initial base representations being learned using only bona fide data and no spoofed data, we obtain the lowest equal error rates reported in the literature for both the ASVspoof 2021 Logical Access and Deepfake databases. When combined with data augmentation, these results correspond to an improvement of almost 90% relative to our baseline system.

show abstract

SUPERB: Speech Processing Universal PERformance Benchmark

Cited by 374 publications

References 0 publications

Dawn of the transformer era in speech emotion recognition: closing the valence gap

Dawn of the transformer era in speech emotion recognition: closing the valence gap

Self-Supervised Learning for speech recognition with Intermediate layer supervision

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation

Contact Info

Product

Resources

About