Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1775
|View full text |Cite
|
Sign up to set email alerts
|

SUPERB: Speech Processing Universal PERformance Benchmark

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
146
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 374 publications
(183 citation statements)
references
References 0 publications
2
146
0
Order By: Relevance
“…Model L FT-SR FT-D UAR WAR Yang et al [29] w2v2-L 65.6 10 Pepino et al [30] w2v2-b 67.2 11 Wang et al [23] hubert-L 67.6 12 Yang et al [29] hubert-L 67.6 13 Chen and Rudnicky [31] w2v2-b 69.9 14 Makiuchi et al [32] w2v2-L 70.7 15 Wang et al [23] w2v2-b 73.8 16 Chen and Rudnicky [31] w2v2-b 74.3 17 Wang et al [23] hubert-b 76.6 18 Wang et al [23] w2v2-L 76.8 19 Wang et al [23] w2v2-b 77.0 20 Wang et al [23] w2v2-L 77.5 21 Wang et al [23] hubert-L 79.0 22 Wang et al [23] hubert-L 79.6 call (WAR) on the four emotional categories of anger (1103 utterances), happiness (1636), sadness (1084), and neutral (1708), which is the typical categorical SER formulation for IEMOCAP. Since we are dealing with an unbalanced class problem, UAR and WAR can diverge.…”
Section: Workmentioning
confidence: 99%
“…Model L FT-SR FT-D UAR WAR Yang et al [29] w2v2-L 65.6 10 Pepino et al [30] w2v2-b 67.2 11 Wang et al [23] hubert-L 67.6 12 Yang et al [29] hubert-L 67.6 13 Chen and Rudnicky [31] w2v2-b 69.9 14 Makiuchi et al [32] w2v2-L 70.7 15 Wang et al [23] w2v2-b 73.8 16 Chen and Rudnicky [31] w2v2-b 74.3 17 Wang et al [23] hubert-b 76.6 18 Wang et al [23] w2v2-L 76.8 19 Wang et al [23] w2v2-b 77.0 20 Wang et al [23] w2v2-L 77.5 21 Wang et al [23] hubert-L 79.0 22 Wang et al [23] hubert-L 79.6 call (WAR) on the four emotional categories of anger (1103 utterances), happiness (1636), sadness (1084), and neutral (1708), which is the typical categorical SER formulation for IEMOCAP. Since we are dealing with an unbalanced class problem, UAR and WAR can diverge.…”
Section: Workmentioning
confidence: 99%
“…Recent work shows [27] that speech pre-trained models can solve full stack speech processing tasks, because the model utilizes bottom layers to learn speaker related information and top layers to encode content related information. We evaluate the proposed method on the SUPERB benchmark [19], 3 https://dl.fbaipublicfiles.com/hubert/hubert large ll60k.pt and https://dl.fbaipublicfiles.com/hubert/hubert large ll60k finetune ls960.pt 4 https://github.com/CorentinJ/librispeech-alignments 5 The results for HuBERT evaluation is slightly different from the reported number in their paper since we use different phonetic force-alignment tool. 3 compares ILS-SSL BASE and HuBERT BASE on SUPERB, indicating the ILS-SSL is better than HuBERT on content and semantic related tasks, while the performance degradation is observed for the speaker related tasks.…”
Section: Evaluation On Non-asr Tasksmentioning
confidence: 94%
“…In HuBERT [13], the authors run k-means clustering on representations of each of 12 blocks of a BASE model and show that cluster assignments from blocks 5-12 have higher mutual information score with force-aligned phonetic transcripts. Furthermore, [19] collect representations from different HuBERT layers and weighted sum them for various downstream tasks where the weights can be learned. The model tends to assign larger weights to top layers for phoneme recognition task, while assign larger weights to bottom layers for speaker-related tasks.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Motivated by (i) its proven application to the learning of general neural representations for a range of different tasks [11][12][13][14][15][16][17][18][19], (ii) evidence that fine-tuning with modest quantities of labelled data leads to state-of-the-art results, (iii) encouraging, previously reported results for anti-spoofing [20,21] and (iv) the appeal of one-class classification approaches [22], we have explored the use of self-supervised learning to improve generalisation. Our hypothesis is that better representations trained on diverse speech data, even those learned for other tasks and initially using only bona fide data (hence one-class), may help to reduce over-fitting and hence improve reliability and domainrobustness, particularly in the face of previously unseen spoofing attacks.…”
Section: Introductionmentioning
confidence: 99%