2019
DOI: 10.48550/arxiv.1904.11469
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The Zero Resource Speech Challenge 2019: TTS without T

Abstract: We present the Zero Resource Speech Challenge 2019, which proposes to build a speech synthesizer without any text or phonetic labels: hence, TTS without T (text-to-speech without text). We provide raw audio for a target voice in an unknown language (the Voice dataset), but no alignment, text or labels. Participants must discover subword units in an unsupervised way (using the Unit Discovery dataset) and align them to the voice recordings in a way that works best for the purpose of synthesizing novel utterances… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
24
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 14 publications
(24 citation statements)
references
References 22 publications
0
24
0
Order By: Relevance
“…Each smaller sequence started and ended with a maximum of 20 ms of non-speech. To make results more comparable to [19], we also train our word segmentation system on the English training set from the ZeroSpeech 2019 Challenge [27] and test on Buckeye. ZeroSpeech English set contains around 15 hours of speech from over 100 speakers.…”
Section: Datasets and Evaluation Metricsmentioning
confidence: 99%
“…Each smaller sequence started and ended with a maximum of 20 ms of non-speech. To make results more comparable to [19], we also train our word segmentation system on the English training set from the ZeroSpeech 2019 Challenge [27] and test on Buckeye. ZeroSpeech English set contains around 15 hours of speech from over 100 speakers.…”
Section: Datasets and Evaluation Metricsmentioning
confidence: 99%
“…Unit discovery An emerging trend in neural architectures, especially as applied to the speech signal, is the use of mechanisms to enable them to induce discrete, symbol-like internal representations, motivated by concepts such as phonemes and morphemes. Recent editions of the ZeroSpeech challenge (Dunbar et al, 2019) on unit discovery have featured many such approaches. In the visually-grounded setting, (Harwath et al, 2020) adopt the vector-quantization (VQ) approach proposed by van den Oord et al (2017), inserting VQ layers at various points in the speech encoder.…”
Section: Variants and Applicationsmentioning
confidence: 99%
“…One feature of the ABX metric is that it is based on a set of tightly controlled stimuli. Alishahi et al (2017) used synthetic audio; in other work such stimuli have been extracted from utterances using aligned phonemic transcriptions (Dunbar et al, 2019;Anonymous, 2021).…”
Section: Abxmentioning
confidence: 99%
“…Each smaller sequence started and ended with a maximum of 20 ms of non-speech. To make results more comparable to [37], we also trained our word segmentation system on the English training set from the ZeroSpeech 2019 Challenge [7] and test on Buckeye. ZeroSpeech English set contains around 15 hours of speech from over 100 speakers.…”
Section: Experiments a Datasets And Evaluation Metricsmentioning
confidence: 99%
“…However, for most spoken languages worldwide, e.g., regional languages, vast amounts of labeled data are not available. Zero Resource speech processing aims to develop alternate techniques that can learn directly from data without any or minimal manual transcriptions [5]- [7]. It This project was supported by NSF Award 1909075 has several applications, including preserving endangered languages, building speech interfaces in low-resource languages, and developing predictive models for understanding language evolution [8].…”
Section: Introductionmentioning
confidence: 99%