UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022

Saeki, Takaaki; Xin, Detai; Nakata, Wataru; Koriyama, Tomoki; Takamichi, Shinnosuke; Saruwatari, Hiroshi

doi:10.48550/arxiv.2204.02152

Cited by 2 publications

(9 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also use data augmentation and multi-task training described 2.2. Finally, we adapted contrastive loss [4] to boost ranking performance.…”

Section: Improved Ssl Baselinementioning

confidence: 99%

“…The MOS trainable metric mimics scores collected by human annotation studies which brings challenges in modeling score variance. Some listeners are more strict than others [2,4], and even a single listener adapts its judgments based on the quality of previously judged recordings.…”

Section: Explaining Noise In Mos Annotationsmentioning

confidence: 99%

“…The works [2,4] showed that it might be effective to use Listener Dependent (LD) modeling. LD modeling classifies listener 6 explicitly to explain the variance by conditioning it either on listener ID.…”

Section: Listener Dependent (Ld) Modelingmentioning

confidence: 99%

“…The challenge consists of two tracks. The size and diversity of the main track data allows building robust robust MOS predictor [3,4,5,6] and the OOD track is meant for testing MOS predictor under strong domain shift with little data for adaptation.…”

Section: Introductionmentioning

confidence: 99%

“…In our paper, we present fine-tuned methods for achieving state-of-the art results for MOS prediction on the VoiceMOS dataset including SSL baseline [3], contrastive loss [4], multi task learning using [7], and augmentation [8,9]. Our improved baseline achieved fourth place and third place on Main and OOD track respectively by using a pretrained SSL speech This work was supported by the Charles University GAUK grant number 40222, and the ERC grant number 101039303.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

MooseNet: A trainable metric for synthesized speech with plda backend

Plátek¹,

Dušek²

2023

Preprint

View full text Add to dashboard Cite

We present MooseNet, a trainable speech metrics that predict listeners' Mean Opinion Score (MOS). We report improvements to the challenge baselines using easy-to-use modeling techniques which also scales for larger self-supervised learning (SSL) model. We present two models. The first model is a Neural Network (NN). As a second model, we propose a PLDA generative model on top layers of the first NN model, which improves the pure NN model. Ensembles from our two models achieve the top 3 or 4 VoiceMOS leaderboard place on all system and utterance level metrics for both main and OOD tracks. 1

show abstract

“…We also use data augmentation and multi-task training described 2.2. Finally, we adapted contrastive loss [4] to boost ranking performance.…”

Section: Improved Ssl Baselinementioning

confidence: 99%

Section: Explaining Noise In Mos Annotationsmentioning

confidence: 99%