Generalization Ability of MOS Prediction Networks

Cooper, Erica; Huang, Wen-Chin; Toda, Tomoki; Yamagishi, Junichi

doi:10.1109/icassp43922.2022.9746395

Cited by 55 publications

(38 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the next experiment, we used two state-of-the-art speech assessment models as our baselines: (1) MOSNet: a model that is based on a CNN-BLSTM architecture for predicting MOS scores [63]; (2) MOS-SSL: a model that uses features from fine-tuned wav2vec 2.0 to predict MOS scores [56]. Both models were trained on the TMHINT-QI dataset with a singletask criterion to predict the quality or intelligibility score separately.…”

Section: Modelmentioning

confidence: 99%

“…Instead of directly using the outputs, we use the embeddings of these SSL models as the SSL features. For more details, please refer to [52] and [56]. Additionally, MOSA-Net adopts a multi-task learning criterion that simultaneously predicts multiple objective assessment metrics, including speech quality, intelligibility, and distortion scores.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

Zezario

Chen

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Non-intrusive speech assessment metrics have garnered significant attention in recent years, and several deep learning-based models have been developed accordingly. Although these models are more flexible than conventional speech assessment metrics, most of them are designed to estimate a specific evaluation score, whereas speech assessment generally involves multiple facets. Herein, we propose a cross-domain multiobjective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. More specifically, MOSA-Net is designed to estimate the speech quality, intelligibility, and distortion assessment scores of an input test speech signal. It comprises a convolutional neural network and bidirectional long short-term memory (CRNN) architecture for representation extraction, and a multiplicative attention layer and a fully connected layer for each assessment metric. In addition, cross-domain features (spectral and timedomain features) and latent representations from self-supervised learned (SSL) models are used as inputs to combine rich acoustic information from different speech representations to obtain more accurate assessments. Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC by 0.021 (0.985 vs 0.964 in seen noise environments) and 0.047 (0.836 vs 0.789 in unseen noise environments) in short-time objective intelligibility (STOI) prediction, compared to STOI-Net (based on CRNN), an existing single-task model for STOI prediction. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC by 0.018 (0.805 vs 0.787) in mean opinion score (MOS) prediction, compared to MOS-SSL, a strong singletask model for MOS prediction. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and

show abstract

Section: Modelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

Zezario

Chen

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Wav2vec 2.0 [1] is a self-supervised framework for speech representation which has been used for a large variety of different speech-related tasks [5,22,23]. One of the main advantages of the wav2vec approach is that a generic pretrained model can be fine-tuned for a specific purpose using only a small amount of labeled data.…”

Section: Model For Audio-based Detectionmentioning

confidence: 99%

Detection of Prosodic Boundaries in Speech Using Wav2Vec 2.0

Kunešová

Řezáčková

2022

Text, Speech, and Dialogue

View full text Add to dashboard Cite

Prosodic boundaries in speech are of great relevance to both speech synthesis and audio annotation. In this paper, we apply the wav2vec 2.0 framework to the task of detecting these boundaries in speech signal, using only acoustic information. We test the approach on a set of recordings of Czech broadcast news, labeled by phonetic experts, and compare it to an existing text-based predictor, which uses the transcripts of the same data. Despite using a relatively small amount of labeled data, the wav2vec2 model achieves an accuracy of 94% and F1 measure of 83% on within-sentence prosodic boundaries (or 95% and 89% on all prosodic boundaries), outperforming the text-based approach. However, by combining the outputs of the two different models we can improve the results even further.

show abstract

“…wav2vec 2.0 was shown to obtain good baseline results for this challenge in SSL-MOS [14]. The speech waveform is input to wav2vec 2.0.…”

Section: Architecturesmentioning

confidence: 99%

“…The advancements in transformer-based pretrained models such as wav2vec 2.0 [12] and HuBERT [13] enable researchers to explore another semi-supervised method to take advantage of the large amount of speech data that exists without subjective labels. SSL-MOS [14] was one of the baselines provided by the challenge organizers which uses wav2vec 2.0 with a minimal extra layer with promising results.…”

Section: Introductionmentioning

confidence: 99%

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset

Chinen¹,

Skoglund²,

Reddy³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-tospeech samples with subjective labels. This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset. Speech quality models were constructed using wav2vec 2.0 with additional metadata features that included rater groups and system identifiers and obtained competitive metrics including a Spearman rank correlation coefficient (SRCC) of 0.934 and MSE of 0.088 at the system-level, and 0.877 and 0.198 at the utterance-level. Using data and metadata that the test restricted or blinded further improved the metrics. A metadata analysis showed that the system-level metrics do not represent the model's system-level prediction as a result of the wide variation in the number of utterances used for each system on the validation and test datasets. We conclude that, in general, conditions should have enough utterances in the test set to bound the sample mean error, and be relatively balanced in utterance count between systems, otherwise the utterance-level metrics may be more reliable and interpretable.

show abstract

Generalization Ability of MOS Prediction Networks

Cited by 55 publications

References 24 publications

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

Detection of Prosodic Boundaries in Speech Using Wav2Vec 2.0

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset

Contact Info

Product

Resources

About