VLSP 2021 - TTS Challenge: Vietnamese Spontaneous Speech Synthesis

This paper describes our speech synthesis system participating in the Vietnamese Text-To-Speech track of the 2021 VLSP evaluation campaign. The goal of this challenge is to build a synthetic voice from a provided spontaneous speech corpus in Vietnamese. In this paper, we propose our implementation of FastSpeech2 model on spontaneous speech. We used a special strategy with spontaneous datasets using the TTS system. We present our utilization in generating mel-spectrograms from given texts and then synthesize speech from generated mel-spectrograms using a separately trained vocoder. In evaluation, our team achieved 3.943 mean score in MOS in-domain test, 3.3 in MOS out-domain test, and 85.00% SUS, which indicates the effectiveness of the proposed system.

Section: Discussionmentioning

confidence: 99%

Section: _______mentioning

confidence: 99%

See 1 more Smart Citation

TTS - VLSP 2021: The Thunder Text-To-Speech System

Ánh

Thành²,

Linh³

2022

“…The TTS shared task [14] organized at the eighth workshop of the Association for Vietnamese Language and Speech Processing (VLSP) requires participants to create a Vietnamese TTS system able to synthesize natural sounding audios while having trained on a spontaneous and noisy dataset. To be precise, this year's dataset for TTS uses speech crawled from videos of a female Hanoi YouTuber named "Giang oi".…”

Section: Introductionmentioning

confidence: 99%

TTS - VLSP 2021: The NAVI’s Text-To-Speech System for Vietnamese

Nguyen

Quoc

et al. 2022

The Association for Vietnamese Language and Speech Processing (VLSP) has organized a series of workshops intending to bring together researchers and professionals working in NLP and attempt a synthesis of research in the Vietnamese language. One of the shared tasks held at the eighth workshop is TTS [14] using a dataset that only consists of spontaneous audio. This poses a challenge for current TTS models since they only perform well constructing reading-style speech (e.g, audiobook). Not only that, the quality of the audio provided by the dataset has a huge impact on the performance of the model. Specifically, samples with noisy backgrounds or with multiple voices speaking at the same time will deteriorate the performance of our model. In this paper, we describe our approach to tackle this problem: we first preprocess the training data then use it to train a FastSpeech2 [10] acoustic model with some replacements in the external aligner model, finally we use HiFiGAN [4] vocoder to construct the waveform. According to the official evaluation of VLSP 2021 competition in the TTS task, our approach achieves 3.729 in-domain MOS, 3.557 out-of-domain MOS, and 79.70% SUS score. Audio samples are available at https://navi-tts.github.io/.

“…Dataset of the competition [14] exploited the voice source from a female youtuber. The challenges of using spontaneous speech are i) poor quality (e.g., inconsistent speaking rate)…”

Section: Introductionmentioning

confidence: 99%

TTS - VLSP 2021: Development of Smartcall Vietnamese Text-to-Speech

Bao

Hoai²,

Hoc³

et al. 2022

Recent advances in deep learning facilitate the development of end-to-end Vietnamese text-to-speech (TTS) systems with high intelligibility and naturalness in the presence of a clean training corpus. Given a rich source of audio recording data on the Internet, TTS has excellent potential for growth if it can take advantage of this data source. However, the quality of these data is often not sufficient for training TTS systems, e.g., noisy audio. In this paper, we propose an approach that preprocesses noisy found data on the Internet and trains a high-quality TTS model on the processed data. The VLSP-provided training data was thoroughly preprocessed using 1) voice activity detection, 2) automatic speech recognition-based prosodic punctuation insertion, and 3) Spleeter, source separation tool, for separating voice from background music. Moreover, we utilize a state-of-the-art TTS system that takes advantage of the Conditional Variational Autoencoder with the Adversarial Learning model. Our experiment showed that the proposed TTS system trained on the preprocessed data achieved a good result on the provided noisy dataset.